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ABSTRACT 

Latent  Dirichlet  Allocation  (LDA)  is  a  scheme  which  may  be  used  to  estimate 
topics  and  their  probabilities  within  a  corpus  of  text  data.  The  fundamen¬ 
tal  assumptions  in  this  scheme  are  that  text  is  a  realisation  of  a  stochastic 
generative  model  and  that  this  model  is  well  described  by  the  combination  of 
multinomial  probability  distributions  and  Dirichlet  probability  distributions. 
Various  means  can  be  used  to  solve  the  Bayesian  estimation  task  arising  in 
LDA.  Our  formulations  of  LDA  are  applied  to  subject  matter  expert  text  data 
elicited  through  carefully  constructed  decision  support  workshops.  In  the  main 
these  workshops  address  substantial  problems  in  Australian  Defence  Capabil¬ 
ity.  The  application  of  LDA  here  is  motivated  by  a  need  to  provide  insights 
into  the  collected  text,  which  is  often  voluminous  and  complex  in  form.  Addi¬ 
tional  investigations  described  in  this  report  concern  questions  of  identifying 
and  quantifying  differences  between  stake-holder  group  text  written  to  a  com¬ 
mon  subject  matter.  Sentiment  scores  and  key-phase  estimators  are  used  to 
indicate  stake-holder  differences.  Some  examples  are  provided  using  unclassi¬ 
fied  data. 
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Numerical  Algorithms  for  the  Analysis  of  Expert  Opinions 

Elicited  in  Text  Format 

Executive  Summary 

This  report  describes  the  motivation,  scope  and  outcomes  of  a  recent  Defence  Science 
and  Technology  Organisation  (DSTO)  research  collaboration  with  Industry,  intended  to 
develop  a  specialised  computer-based  text  analysis  capability.  In  March  of  2011,  a  formal 
research  agreement  was  struck  between  the  Joint  Operations  Divsion  (JOD)  and  the  Na¬ 
tional  ICT  Australia  (NICTA).  Fundamentally  this  agreement  was  aimed  at  developing  a 
specific  text  analysis  capability,  with  particular  emphasis  placed  upon  examining  collec¬ 
tions  of  text-format  expert  opinions,  each  of  which  concerned  a  given  defence  capability 
issue.  Here  the  term  text  analysis  might  include:  identifying  a  finite  number  of  key  topics 
in  a  text  corpus  and  their  relative  weightings,  or,  some  quantitative  measure  of  difference 
between  stake-holder  group  opinions  on  specific  common  issue  etc. 

The  primary  motivation  for  this  work  is  derived  from  text-data  volume  &  processing 
issues  arising  in  the  Joint  Decision  Support  Centre  (JDSC).  The  JDSC  was  established  in 
March  of  2006  and  is  one  component  of  a  unique  collaboration  between  DSTO  and  the 
Capability  Development  Group  (CDG).  Part  of  the  JDSC’s  core  program  of  work  concerns 
providing  decision  support  to  current  projects  listed  in  the  Defence  Capability  Plan  (DCP). 
The  common  vehicle  for  this  support  is  a  facilitated  defence  capability  workshop.  Such 
workshops  typically  run  for  2-4  days  and  may  include  up  to  40  attendees,  consisting  of 
technical  SMEs,  Australian  Defence  Force  (ADF)  staff  and  representatives  from  various 
stake- holder  groups.  These  workshops  are  carefully  designed  to  address  specific  defence 
questions  and  to  elicit,  record  and  analyse  expert  opinions.  Note,  it  is  important  to 
understand  that  the  JDSC’s  scope  here  best  ‘approximates’  what  is  known  as  the  Expert 
Problem  as  its  described  in  the  Taxonomy  due  to  French  [Fre85].  Briefly,  the  Expert 
Problem  is  defined  as  follows: 

Definition  0.1  (French,  1985).  A  group  of  experts  are  asked  for  advice  by  a  Decision 
Maker  (DM)  who  faces  a  specific  real  decision  problem.  The  DM  is,  or  can  be  taken  to 
be,  outside  the  group.  The  DM  takes  responsibility  and  accountability  for  the  consequences 
of  the  decision.  The  experts  are  free  from  such  responsibility  and  accountability.  In  this 
context  the  emphasis  is  on  the  DM  learning  from  the  experts. 

In  our  context  the  relevance  of  French’s  definition  is  primarily  expressed  in  the  last 
sentence  of  his  definition,  emphasising  the  DM  learning  from  experts.  Consequently,  JDSC 
decision  support  workshops  are  orientated  towards  informing  Defence  Decision  Makers 
though  workshops  and  their  outcomes.  JDSC  workshop  data  are  generally  of  two  classes: 
1)  numeric,  such  as  voting  scores  or  quantitative  preference  rankings,  or  2)  text  data 
collected  through  network-based  text  collection  software.  The  text  data  collected  at  JDSC 
workshops  is  usually  rich  in  content,  but  significant  in  volume.  Ideally,  this  data  should  be 
analysed  both  in  situ ,  that  is  during  a  given  workshop,  and  off-line  post- workshop.  The 
main  tasks  here  are  data  reduction  and  visualisation,  that  is,  to  compute  an  accessible 
summary  visualisation  of  valuable  information  inherent  in  a  corpus  of  text  likely  to  inform 
Defence  Decision  Makers.  It  should  also  be  noted  here  that  while  the  motivation  for 
this  project  originated  from  the  inherent  needs  of  JDSC  workshops,  the  outcomes  of  this 
project  are  not  limited  to  JDSC  related  activities. 
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The  main  outcomes  detailed  in  this  report  concern  the  development  and  capabilities 
of  a  set  of  text  analysis  algorithms  intended  to  support  and  enhance  the  various  tasks 
described  above.  Specific  capabilities  detailed  here  are: 

•  Probabilistic  Topic  Analysis:  Topic  analysis  concerns  identifying  a  finite  number 
of  topics  within  a  corpus  of  text  and  subsequently  estimating  a  level  of  association 
of  document  elements  (such  as  words  or  phrases)  to  each  of  these  topics. 

•  Differential  Analysis:  Differential  analysis  concerns  identifying  and  quantifying 
the  differences  between  subsets  of  text,  where  set  membership  is  by  affiliation  to 
a  specific  stake-holder  group.  For  example,  what  might  be  the  differences  between 
text  data  generated  by  ARMY  SMEs  and  Air  Force  SMEs  on  a  common  defence 
capability  issue?  Further,  how  might  such  differences  be  computed  and  analysed? 

•  Key-Phrase  Analysis:  Key  phrase  analysis  concerns  identifying  and  ranking  the 
top  N  phrases  in  a  document,  either  for  the  complete  document  or  subsets  of  text 
attributed  to  various  stake-holders. 

This  report  also  contains  technical  detail  on  mathematical  foundations  of  the  work  and 
specific  details  on  some  algorithmic  issues  inherent  in  its  complex  estimation  tasks.  Finally, 
an  example  of  the  algorithms  at  work  on  an  unclassified  text  data  set  is  provided.  This 
text  data  was  collected  at  a  special  JDSC  workshop  including  two  groups  only,  DSTO 
staff  and  NICTA  staff.  Primarily  this  unclassified  data  is  included  to  demonstrate  graphic 
visualisations  of  the  three  aforementioned  core  tasks. 
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1  Introduction 

Modern  algorithmic  text  analysis  includes  vast  areas  of  research  and  application.  Of 
course  much  of  the  momentum  in  these  areas  has  arisen  from  the  enormous  changes  over 
the  last  20  years  or  so  in  the  way  (text)  information  is  collected,  digitised,  stored  and  made 
available  through  media  such  as  the  Internet.  The  many  current  domains  of  research  and 
application  in  text  analysis  are  far  too  numerous  to  mention  in  this  report  (the  interested 
reader  might  consider  [Ber04],[FNR03]).  Instead  we  restrict  our  attention  to  a  specific 
defence  task  in  text  analysis,  that  is,  supporting  Joint  Decision  Support  Centre  workshops 
by  providing  a  means  to  graphically  depict  and  summarise  certain  information  contained 
in  a  corpus1  of  specially  elicited  text.  What  we  would  like  to  do  is  examine  a  corpus  of  text 
collected  through  the  JDSC  workshop  model.  In  particular  we  would  like  an  estimated 
topic  map  for  a  given  corpus  of  text,  showing  subsets  of  words  associated  to  a  given 
topic  and  the  estimated  probability  of  these  associations.  Further  analysis  detailed  in  this 
report  will  consider  differential  text  analysis  on  a  collection  of  text  and  the  identification 
and  rankings  of  key  phrases.  The  need  for  a  differential  analysis  capability  arises  in  part 
from  the  raison  d’etre  motivating  the  JDSC,  which  was  to  ensure  that  defence  capability 
questions  addressed  through  the  JDSC  are  contributed  to  by  all  stake-holder  groups,  such 
as  Army,  Navy,  Air  Force  etc.  Each  of  these  attending  stake-holder  groups  will  submit  text 
on  a  common  point  of  study,  for  example,  the  speculated  operational  value  of  a  certain 
class  of  Helicopter  etc.  A  natural  task  here  is  to  examine  differences  (or  similarities)  in 
the  text  submitted  by  the  different  stake- holder  groups.  We  would  also  like  to  have  a 
means  by  which  we  could  identify  and  rank  the  main  phrases  in  a  corpus  of  text.  The 
text  analysis  capability  described  above  is  intended  to  solve  two  text  analysis  problems 
in  the  JDSC,  1)  the  on-line  problem  and  2)  the  off-line  problem.  The  on-line  problem 
concerns  developing  an  in  situ  capability  which  would  provide  the  workshop  facilitation 
and  workshop  team  visibility  of  the  elicited  text  through  summarising  graphic  depictions  of 
that  text.  The  intention  is  to  provide  that  capability  in  real  time  as  the  workshops  evolve. 
The  off-line  problem  refers  to  analysing  the  elicited  text  after  a  workshop  is  completed. 
Typically  4-day  workshops  with  20-40  SMEs  can  generate  a  large  amount  of  text.  The 
JDSC  analyst  needs  an  effective  means  of  analysing  and  reporting  on  such  text.  Currently 
there  are  many  software  products  on  the  market  that  address  various  areas  of  text  mining 
and  text  analysis,  most  with  emphasis  oriented  towards  the  analysis  of  news  media,  or 
applications  in  business  and  marketing  domains.  However,  few  of  these  products  directly 
address  stake-holder  workshop  elicited  text  data  in  a  defence  context.  This  situation,  in 
part,  motivated  a  text  analysis  research  program  with  an  external  partner. 

The  DSTO/JDSC  research  agreement  with  the  National  ICT  Australia  to  develop  a 
text  analysis  capability  was  signed  by  both  parties  in  March  of  2011,  with  an  estimated 
project  duration  of  approximately  12  months.  The  deliverable  in  this  agreement  included 
a  software  capability  to  analyse  workshop  text  data  in  three  respects:  topic  estimation, 
differential  analysis  and  key  phrase  estimation  and  ranking.  NICTA  is  a  federally  funded 
Australia-wide  research  organisation  founded  in  2002  and  has  six  core  research  groups, 
these  are: 
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Here  the  term  corpus  of  text  refers  to  a  relatively  large  structured  collect  ion/set  of  text 
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1.  Computer  Vision, 

2.  Control  and  Signal  Processing, 

3.  Machine  Learning, 

4.  Networks, 

5.  Optimisation, 

6.  Software  Systems. 


Text  analysis  research  in  NICTA  is  carried  out  predominantly  by  the  Machine  Learning 
area  but  is  not  limited  to  any  one  specific  NICTA  research  laboratory.  The  research 
project  detailed  in  this  report  engaged  machine  learning  researchers  based  at  the  NICTA 
Laboratory  in  Canberra  led  by  Prof.  Wray  Buntine.  There  are  many  diverse  means  by 
which  one  can  analyse  text  data,  for  example,  stochastic  and  deterministic  methods  can 
be  applied.  The  approach  taken  in  this  project  is  stochastic  and  inherently  Bayesian  and 
based  upon  the  notion  of  a  stochastic  generative2  model.  The  specific  modelling  paradigm 
used  here  for  topic  estimation  is  Latent  Dirichlet  Allocation  (LDA). 


1.1  Aims  of  this  Report 

This  report  essentially  describes  the  outcomes  of  a  recently  completed  DSTO/NICTA 
research  contract  to  develop  a  computer-based  text  analysis  capability.  Its  main  aims 
were: 

•  to  describe  the  eliciting  of  SME  text  data  through  the  vehicle  of  JDSC  workshops, 

•  to  briefly  recall  some  basic  elements/notions  in  text  analysis, 

•  to  describe  the  technical  details  of  the  estimation  schemes  used,  including  LDA  and 
its  numerical  implementation  through  a  variant  of  Gibbs  Sampling 

•  to  explain  the  visualisations  used  for:  topic  estimation,  differential  analysis  and  key 
phrase  estimation, 

•  to  provide  examples  of  the  text  analysis  capability  developed  through  an  unclassified 
(real)  text  data-set  collected  at  a  JDSC  workshop  and 

•  to  provide  expository  appendices  on  some  fundamental  components  of  this  work  such 
as  Dirichlet  probability  distributions  and  Gibbs  sampling. 

2This  class  of  model  will  be  explained  later. 
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1.2  Organisation  of  this  Report 

This  report  is  organised  as  follows.  In  Sections  2  and  3,  we  briefly  describe  the  main 
functions  of  the  JDSC  and  how  text  is  elicited  at  the  JDSC  through  facilitated  workshops. 
These  sections  provides  background  and  context  only.  The  collection  of  an  unclassified 
text  data  set  is  described  in  §4. 

In  §5  we  recall  some  basic  notions  in  text  analysis  such  as  bag  of  words  representations 
and  text  pre-processing  tasks  such  as  the  removal  of  stop-words  etc.  We  also  provide  some 
brief  details  on  relevant  elements  of  Natural  Language  Processing. 

The  main  core  of  technical  work  in  this  report  begins  in  §6  detailing  the  specific 
theory  applied  for  probabilistic  topic  modelling.  This  development  of  core  technical  work 
continues  through  to  Sections  7  and  8,  covering  differential  analysis  and  key  phrase  analysis 
respectively.  In  §9  we  list  possible  extensions  of  the  work  described  in  this  report.  Finally, 
a  collection  of  appendices  are  provided  for  completeness,  covering  certain  (less  known) 
probability  distributions  and  some  basic  on  Gibbs  sampling.  In  particular  the  Dirichlet 
distribution  is  discussed  in  Appendix  D.  While  this  interesting  distribution  is  not  in 
general  widely  known,  it  plays  a  fundamental  role  in  the  topic  estimation  results  in  this 
report. 


2  The  JDSC 


2.1  Background 

The  JDSC  was  established  in  2006  and  was  motivated  by  the  need  to  support  strategic 
decision  making  processes  concerning  Defence  Capability.  The  JDSC  supports  decision 
processes  by  facilitating,  enhancing  and  contributing  to  the  convergence  of  decision  mak¬ 
ing.  The  primary  aim  of  the  JDSC  is  to  provide  objective,  impartial,  clear  and  timely 
decision  support  to  future  defence  capability  decisions,  including  current  projects  listed  in 
the  Defence  Capability  Plan.  The  DCP  is  defined  as  follows, 

Definition  2.1  (DCP).  The  DCP  outlines  the  Governments  long  term  Defence  capability 
plans.  It  is  a  detailed,  costed  10  year  plan  comprising  the  unapproved  major  capital  equip¬ 
ment  projects  that  aim  to  ensure  that  Defence  has  a  balanced  force  that  is  able  to  achieve 
the  capability  goals  identified  in  the  (current)  Defence  White  Paper  and  subsequent  strate¬ 
gic  updates. 


2.2  Function 

The  DCP  is  a  living  document  listing  defence  capability  projects,  for  example  SEA1000 
concerns  Australia’s  Future  Submarine,  JP2060  Army’s  Deployed  Health.  The  majority  of 
DCP  projects  are  phased  over  time  and  are  progressively  reviewed  by  the  appropriate  com¬ 
mittees.  Further,  such  committees  may  require  detailed  analysis  on  key  issues  concerning 
a  given  capability.  The  JDSC  may  be  tasked  to  provide  such  analysis  either  through  the 
model  of  a  facilitated  workshop,  or  by  other  means.  Capability  questions  examined  through 
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JDSC  workshops  may  address,  for  example,  the  updating/modification/rationahsation  of 
a  Defence  Capability  project,  or,  the  introduction  of  a  new  Defence  Capability.  In  most 
of  these  cases  the  outputs  of  JDSC  decision  support  includes  a  written  report.  This  sup¬ 
port  is  substantiated,  where  feasible,  by  broad  engagement  of  stake-holders,  the  inclusion 
of  subject  matter  experts,  technical  reach-back  into  DSTO  (and  through  it  to  the  wider 
scientific  community),  in-house  modelling  and  simulation  and  scientific  rigor.  A  graphic 

Figure  1:  Graphic  depiction  of  the  JDSC  Facility.  Typically  each  participant  seated 
in  this  facility  will  have  an  individual  computer  terminal  for  text  entry.  In  this  facility 
visualisation  is  extensively  used  to  create  context  and  provide  information  &  immersion 
for  the  workshop  participants. 


depiction  of  the  JDSC  facility  is  shown  in  Figure  1.  Typically  JDSC  workshops  run  for 
2-4  full  days  and  might  include  20-40  specialised  SMEs. 


3  Eliciting  Expert  Opinions  via  Facilitated 

Workshop 

The  techniques  for  elicitation  and  facilitation  of  SME  workshops  addressing  the  Expert 
Problem  are  diverse  and  many.  This  area  is  and  indeed  well  beyond  the  scope  of  this 
report,  however,  some  comments  are  in  order  to  provide  context.  The  interested  reader 
might  refer  to  the  recent  article  [PAF+07],  or  a  relatively  recent  special  issue  of  the  Journal 
of  Operations  Research  Society  2007,  Number  58.  These  papers  consider  what  is  generally 
referred  to  as  Problem  Structuring  Methods  (PSMs).  PSMs  are  fundamental  to  the  facili¬ 
tated  group  settings  of  the  JDSC  where  pre- workshop  tasks  concern  detailed  examinations 
of  the  defence  capability  issue  to  be  studied.  Ultimately  a  schedule  must  be  generated  for 
each  workshop,  including  a  program  of  questions,  surveys,  debate  topics  voting  schemes 
etc,  all  intended  to  elicit  expert  information  for  the  benefit  of  the  remote  DM.  This  task 
is  far  more  complex  than  it  might  first  seem.  Further,  compounding  this  task  is  sheer  & 
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often  elusive  complexity  of  the  problems  studied  at  the  JDSC.  The  unfortunate  conven¬ 
tion  used  to  name  this  class  of  problem  is  to  refer  to  them  as  “Wicked  Problems”  (see 
[Pid96b]  [Pid96a]).  Such  problems  are  characterised  in  [PAF+07]  as  follows: 

•  They  are  one-off  problems  that  may  have  some  similarities  with  previous  problems 
but  have  never  been  encountered  before. 

•  Solving  Wicked  Problems  may  cause  or  worsen  other  interconnected  problems. 

•  There  are  usually  many  stake- holders,  often  holding  conflicting  values  and  perspec¬ 
tives  in  the  decision  context. 

•  There  is  no  right  or  wrong  solution;  there  may  be  “solutions”  that  are  perceived  to 
be  good  by  some,  but  seldom  all  stake-holders. 

Suffice  to  say,  the  task  of  eliciting  expert  opinions  in  JDSC  workshops  is  generally  difficult. 
However,  one  can  (loosely  speaking)  categorise  typical  SME  responses.  These  responses 
might  be  a  numeric  vote  of  preference,  a  binary  reply  accept/reject  or  a  text  input  response 
either  offering  an  opinion  or  a  text-form  answer  to  as  specific  problem.  It  is  precisely  the 
analysis  of  these  text  responses  that  we  are  concerned  with  in  this  report. 


3.1  Computer-based  Text  Collection  for  Sets  of  Experts 

The  JDSC  uses  collaborative  text  entry  software,  with  which  participants  (i.e.  experts) 
can  enter  their  opinions  on  certain  topics.  The  means  of  text  entry  for  JDSC  workshop 
(co-located)  participants  is  via  a  single  per/person  computer  terminal.  In  most  cases  each 
attendee/participant  will  have  their  own  computer  terminal.  Each  entry  is  date  and  time 
stamped  and  contains  a  user  number  and  so  can  be  tagged  to  a  particular  participant. 
Text  responses  can  be  a  sentence,  a  paragraph  or  indeed  several  paragraphs.  An  example 
of  this  data  is  given  in  Appendix  F. 


4  The  Elicitation  of  an  Unclassified  Text  Data 

Set 


4.1  Background 

Given  the  JDSC  is  an  Australian  defence  facility,  much  of  its  work,  including  outputs  and 
collected  data,  is  naturally  classified  and  so  is  not  generally  available  to  external  research 
partners.  The  work  in  this  Technical  Report  is  based  upon  a  research  collaboration  with 
NICTA,  which  is  separate  to  the  Department  of  Defence.  This  motivated  a  clear  need  to 
develop  an  unclassified  text  data  set. 

On  the  27th  March  2011,  a  specially  prepared  text  collection  workshop  was  held  at  the 
JDSC  Canberra.  This  workshop  involved  just  two  stake-holder  groups,  JOD/DSTO  staff 
and  NICTA  research  staff.  The  intention  of  this  event  was  to  generate  an  unclassified  text- 
data  set,  elicited  through  a  typical  JDSC  workshop  and  to  expose  the  NICTA  researchers 
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to  an  example  of  precisely  how  JDSC  workshop  text  is  elicited  in  an  unclassified  setting. 
While  the  duration  of  this  workshop  was  only  half  a  working  day,  it  did  serve  the  purpose 
of  generating  a  useful  and  accessible  text-data. 

Prior  to  this  workshop  the  JDSC  Discovery  Team3  contributed  to  the  planning  and 
scheduling  of  the  workshop.  Here  the  foremost  task  was  to  decide  upon  an  unclassified 
subject  matter  suitable  for  the  workshop,  that  is,  a  subject  matter  rich  enough  to  engender 
healthy  engagement  and  produce  a  text  corpus  containing  some  diversity,  such  as  different 
opinions,  polarity,  sentiment  and  bias.  The  workshop  ran  over  the  course  of  roughly  one 
afternoon  and  engaged  approximately  15  participants.  Consequently  this  is  statistically 
typical  of  JDSC  workshops  which  could  run  over  4  full  days  and  involve  up  to  40  or  more 
participants. 


4.2  Specific  Elicitation  Questions 

The  three  tables  below  list  the  session-wise  questions  which  were  submitted  to  the  par¬ 
ticipants.  In  general,  the  unclassified  subject  matter  from  which  these  questions  were 
derived  concerned;  decision  making,  research  and  automated  text  analysis.  These  ques¬ 
tions  were  designed  to  (hopefully)  engender  vigorous  discussion  and  illuminate  sentiment 
and  potential  differences  between  the  two  participating  groups. 

Table  1:  Workshop  Session  1  (Decision  Making) 


Facilitator  Question 

Org.  Diff. 

Sentiment 

Polarised 

Acronyms 

Ql:  What  are  the  difference  be¬ 
tween  complex  decision  and  simple 
decision  making? 

/ 

X 

X 

X 

Q2:  Computer-based  analytical 

tools  are  less  effective  for  decision 
support  than  a  human  analyst? 

/ 

/ 

X 

/ 

Q3:  Users  of  decision  support  tools 
need  to  know  the  details  (theoretical 
basis)  to  achieve  effective  use? 

/ 

/ 

X 

/ 

Remark  4.1.  The  unclassified  data  set  generated  from  this  JDSC  activity  is  relatively 
short.  Consequently  its  not  statistically  significant.  The  purpose  of  the  data  is  largely  to 
depict  software  outputs  based  upon  real  data,  that  is  text  data  in  a  real  JDSC  workshop. 
In  Appendix  F  we  show  the  data  collected  from  this  activity  corresponding  to  Table  2. 

To  provide  a  brief  example  here,  we  show  just  two  responses  to  Question  4  which  were 
elicited  in  Session  2.  This  question  was,  How  is  the  value  of  research  best  measured  ?  Two 

3The  JDSC  Discovery  Team  was  raised  2010.  Its  primary  tasks  concern  pre  workshop  engagement  to 
identify  and  shape  the  specific  fundamental  defence  capability  questions  being  investigated  by  clients  and 
subsequently  addressed  in  decision  support  workshops.  Typical  outcomes  of  such  preparation  work  might 
be:  a  workshop  plan/strategy,  a  workshop  schedule  and  a  specific  sequence  of  questions  for  the  workshop 
facilitator  to  elicit  expert  opinions 
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Table  2:  Workshop  Session  2  (Research  and  Development) 


Facilitator  Question 

Org.  Diff. 

Sentiment 

Polarised 

Acronyms 

Q4:  How  is  the  value  of  research 
best  measured? 

/ 

/ 

/ 

X 

Q5:  Client-based  research  produces 
limited  outcomes? 

/ 

/ 

X 

X 

Q6:  Industry-based  Research  &  De¬ 
velopment  produces  more  useful  out¬ 
comes  than  Academia 

/ 

/ 

X 

X 

Table  3:  Workshop  Session  3  (Text  Analysis) 

Facilitator  Question 

Org.  Diff. 

Sentiment  Polarised 

Acronyms 

Q7:  Human  analysis  of  text  to  cre¬ 
ate  structure  is  more  powerful  than 
computer-based  text  analysis? 

/ 

/  / 

X 

Q8:  Unstructured  data  does  not 

help  a  decision  making  process, 
quantitative  fact-based  data  is  re¬ 
quired 

/ 

/  X 

X 

Q9:  Other  industries  outside  of  de¬ 
fence  might  benefit  from  automated 
text  analysis 

X 

X  X 

X 
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typical  responses  showing  participant  number,  date  stamps  and  time  stamps  are  shown 
below. 

o  1.1  Publications  and  Journal  Rankings 
Submitted  by  19  (2011-03-24  22:14:55) 

o  1.2  The  number  of  citations  both  primary  and  secondary  citations 
Submitted  by  14  (2011-03-24  22:15:15) 

5  Some  Basic  Elements  of  Text  Analysis 

In  this  section  we  recall  some  basic  elements  in  modern  text  analysis  and  in  particular  text 
analysis  definitions  and  routines  used  in  the  work  reported  here.  This  material  includes 
some  basic  results  in  Natural  Language  Processing  (NLP)  and  a  brief  statement  of  the 
important  theorem  of  Bruno  de  Finetti. 


5.1  Representations  for  Text  Data 

It  is  clear  that  text-data  is  an  abstract  data-type  and  in  general  a  difficult  class  of  data 
type  with  significant  diversity  and  complexity.  To  proceed  with  any  form  of  analysis  on 
text  data  one  must  first  decide  upon  a  suitable  representation  for  text  such  that  the  text 
in  question  can,  for  example,  take  values  in  a  space  and  thereafter  be  examined  through 
pattern  recognition,  applying  metrics,  or  statistical  analysis  etc. 

There  are  numerous  representations  available  for  text,  some  deterministic  and  some 
stochastic.  To  fix  some  basic  ideas  we  start  with  a  deterministic  representation  and  make 
use  of  geometric  ideas  in  Euclidean  space  and  linear  algebra. 

5.1.1  Deterministic  Models  for  Text  Data 

The  most  common  deterministic  model  for  text  is  the  vector  space  model  due  to  G. 
Salmond4  (see  [SM86],  [SWY75]).  The  Vector  space  model  and  it’s  numerous  variants, 
essentially  characterise  a  given  document  by  a  point  in  Euclidean  space,  that  is,  document 
i,  written  D1  G  Mp,  is  defined  by  the  vector 

D‘  =  (m;,m),...,?n‘)'.  (1) 

Here  the  components  m j  might  indicate  weights  for  importance  of  terms  within  the  docu¬ 
ment  and  typically  would  be  computed  by  combination  of  a  document-importance  weight 
and  corpus-importance  weight.  The  convention  is  to  consider  a  local  (document  level) 
weight  and  a  global  (corpus  level)  weight,  that  is,  for  some  term  (typically  a  word)  la¬ 
belled  j,  we  compute 

mj  =  ij  x  9j.  (2) 

4As  an  aside,  there  is  a  curious  history  with  this  work  and  its  appearance  in  the  published  literature, 
see  the  text  [Dur04]. 
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There  are  numerous  schemes  in  the  literature  to  compute  i  and  g,  see  the  texts  [FBY92], 
[Kow97].  The  primary  application  of  the  vector  space  model  has  been  in  Information 
Retrieval  (IR).  Put  simply,  an  IR  system  matches  user  queries  (formal  statements  and 
information  needs)  to  documents  stored  in  a  database.  Figure  2  shows  a  simple  example 
of  two  documents  in  a  Vector  space  and  a  “query”  in  its  corresponding  vector  form.  With 
this  model  one  might  define  distance  (from  the  document  in  question)  to  the  query  vector 
through  the  angle,  or  through  the  euclidean  norm.  As  a  text  analysis  example  one  might 
convert  all  document  vectors  to  unit  vectors  and  consider  the  pattern  of  points  resulting 
on  a  unity-radius  hypersphere.  Presumably  a  variety  of  algorithms  might  then  be  applied 
to  classify  subsets  of  documents/points  in  a  corpus  or  to  perform  cluster  analysis  (see 
[Bis06]  and  [Bra89]).  If  a  corpus  of  documents  is  mapped  to  a  vector  space  through  the 

Figure  2:  A  simple  example  of  two  documents  and  a  query  in  vector  space  form. 


representation  at  (1),  then  one  might  collect  these  vectors  as  columns  of  a  matrix.  This  is 
the  usual  starting  point  for  matrix-based  methods  of  text  analysis  such  as  Latent  Semantic 
Indexing  (LSI),  see  [Ber04], 

Remark  5.1.  There  are  known  limitations  with  vector  space  models  which  are  well  docu¬ 
mented  in  the  literature.  One  concerns  semantic  sensitivity,  that  is,  documents  with  similar 
context/meaning  but  which  use  different  vocabularies  will  not  be  associated  resulting  in  a 
failure  to  conclude  a  match. 

5.1.2  Stochastic  Models  for  Text  Data 

Stochastic  models  for  text  are  perhaps  better  known  than  deterministic  models.  Indeed 
the  origin  of  the  stochastic  processes  known  as  Markov  chains,  (due  to  Andrei  Andree¬ 
vich  Markov,  see  [HS01])  was  Markov’s  investigation  into  a  two-state  chain  (vowels  and 
consonants)  derived  from  Pushkin’s  poetry.  Moreover,  its  easy  to  imagine  how  such  an  in¬ 
vestigation  might  be  extended  beyond  characters  to  parts  of  speech,  such  as  prepositions, 
nouns  and  adjectives  etc.  If  one  considered  English  text  as  a  sequence  of  such  parts,  then 
a  dependent  stochastic  process  such  as  a  Markov  chain  might  easily  be  modelled.  For 
example,  the  probability  of  a  transition  from  a  preposition  to  a  preposition  would  be  low, 
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however  the  probability  of  a  transition  from  an  adjective  to  a  noun  would  be  high.  An 
interesting  example  of  Markov  chains  applied  to  text  is  given  in  the  article  [IG01].  The 
work  subsequently  described  in  this  report  is  entirely  based  upon  a  stochastic  model  which 
is  a  special  type  of  stochastic  generative  model. 

Remark  5.2.  The  essential  point  to  note  on  stochastic  models  for  text  analysis  is  that 
a  corpus  of  documents,  (or  indeed  a  single  document),  is  taken  as  a  realisation  of  some 
class  of  stochastic  model.  This  is  a  critical  assumption  and  is  pivotal  to  understanding 
probabilistic  topic  modelling. 


5.2  Some  Basic  Definitions  in  NLP 

There  are  many  excellent  texts  dealing  with  NLP,  for  example  see  [MS99]  and  [Mit04], 
This  is  a  vast  subject  and  well  beyond  the  scope  of  this  report.  The  inclusions  below 
have  been  chosen  to  give  some  indication  about  the  pre-processing  of  our  text  before  it  is 
analysed  for  topics  etc. 

Definition  5.1  (Corpus).  A  corpus  is  a  structured  collection  of  language  texts  that  is 
intended  to  be  a  rational  sample  of  the  language  in  question  (see  the  interesting  reference 
[JT06]).  A  very  well  known  corpus  is  the  Brown  corpus  of  the  English  Language  which 
was  made  available  in  1960.  The  plural  of  corpus  is  corpora. 

Definition  5.2  (Stop  Words).  A  “stop  word”  is  a  word  that  is  functional,  rather  than 
carrying  information  or  meaning.  These  words  are  sometimes  called  function  words.  Some 
examples  are  prepositions  and  conjunctions.  An  example  subset  of  typical  stop  words  in 
English  might  be  {also,  it,  go,  to,  she,  they}.  There  are  many  well  known  lists  of  stop 
words  in  English  available  electronically  on  the  Internet,  see  for  example  http://www. 
ranks .nl/resources/stopwords . html. 

Definition  5.3  (Stemming).  Loosely  speaking,  stemming  concerns  identifying  a  canon¬ 
ical  or  irreducible  component  of  a  word  common  to  all  inflections  of  a  word.  For  example, 
consider  the  conjugation  of  German  verbs  and  in  particular  those  verbs  known  as  weak 
verbs.  These  verbs  have  an  irreducible  component  called  the  stem  vowel.  For  example, 
the  German  verb  arbeiten  (to  work),  is  conjugated  in  the  following  way  in  Table  f,  In 

Table  f:  A  simple  stemming  example  with  German  verbs 


ich  arbeite, 
du  arbeitest, 
wir  arbeiten. 


this  example  the  stem  vowel  is  arbeit.  A  similar  example  in  English  might  be;  computer, 
computing,  computed,  computation.  Many  algorithms  for  stemming  are  readily  available 
on  he  Internet,  for  example  see  http://tartarus.org/~martin/PorterStemmer/  and 
http : //www . comp . lanes . ac.uk/ computing/research/stemming/. 

Definition  5.4  (Lemmatisation).  Lemmatisation  is  similar  (roughly)  to  Stemming  and 
refers  to  an  algorithmic  process  to  identify  the  so-called  lemma  of  a  given  word.  In  general 
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this  task  is  somewhat  more  sophisticated  than  stemming  as  it  may  require  an  analysis  of 
context  for  a  specific  word  and  also  a  tagging  into  a  group,  such  as  a  noun  or  a  verb  by 
using  a  Parts  of  Speech  (POS)  algorithm.  For  example  a  verb  such  as  running  would  have 
the  suffix  ing  removed  whereas,  a  noun  in  plural  form  such  as  houses,  would  be  reduced  to 
house. 


Definition  5.5  (Named  Entities  &;  Recognition).  Named  Entity  recognition  involves 
the  task  of  identifying  proper  names  as  they  relate  to  a  set  of  predefined  categories  of 
interest.  This  is  in  general  not  a  trivial  task  in  NLP  and  encounters  some  immediate 
problems,  for  example  the  word  June,  does  it  refer  to  a  month  or  a  person  called  June  ? 
The  word  Washington  might  be  George  Washington  and  it  might  be  a  state  of  the  USA. 
Sub  might  refer  to  Submarine,  boat  might  also  refer  to  submarine. 

Definition  5.6  (Tokens  and  Tokenisation).  The  word  Token  has  a  variety  of  meanings, 
in  our  context  it  means  a  convenient  symbol  to  represent  something  of  interest,  usually 
words.  The  task  of  Tokenisation  is  to  represent  text  as  a  collection  of  tokens.  Tokens  may 
be  words  or  grammatical  symbols.  In  most  cases  tokenisation  relies  upon  locating  word 
boundaries  such  as  the  beginning  or  ending  of  a  word.  Tokenisation  can  sometimes  be 
referred  to  as  word  segmentation. 

Definition  5.7  (Common  Frequency /Weighting  Measures).  In  basic  text  analysis 
and  indeed  areas  of  cryptography,  basic  frequency  indicators  are  often  used  for  inference 
and  analysis,  for  example  word  or  character  counts  and  their  distributions  can  be  infor¬ 
mative.  In  text  analysis  the  two  most  common  frequency  counts  used  are  the  so-called 
term-frequency  (TF)  and  term-inverse-document-frequency  counts  (TF-IDF).  Term  Fre¬ 
quency  is  just  a  straight  count  of  the  number  of  occurrences  of  a  given  term  in  a  given 
document.  For  example,  suppose  the  term  in  question  is  denoted  by  t  and  the  document 
by  Dl ,  then  this  quantity  is  written  as  TFtDi  £  N.  Note  this  quantity  is  not  a  corpus 
level  quantity.  In  contrast,  the  TF-IDF  includes  a  measure  of  frequency  at  a  corpus  level, 
that  is,  it  extends  TF  to  include  occurrence  counts  in  all  documents  in  a  corpus.  To  label 
a  single  corpus  of  I  documents,  we  write  T>  =  {D1,D2, . . .  ,DIj.  The  Inverse  Document 
Frequency  (IDF)  may  be  written, 


ID  Ftp  =  log  < 


\V\ 


1  +  \{D  ev  \  t  e  D}\ 


(3) 


Here  \  ■  \  denotes  cardinality  and  the  addition  of  unity  on  the  denominator  to  avoid  divide 
by  zero  errors  in  cases  where  a  term  does  not  appear.  The  most  widely  used  weight,  (ie 
the  vector  components  mj  in  equation  2),  for  a  candidate  term  tj  in  given  document  Dl 
is 


m. 


=  TF, 


tj,Di 


X  IDF, 


tj  ,T>  • 


(4) 


The  NLP  community  uses  a  variety  of  frequency  measures  such  as  the  one  at  (4), 
choice  usually  depends  upon  context  and  perhaps  specific  data  issues.  One  can  see  from 
equation  (3)  that  its  task  is  to  attenuate  the  weight  of  common  words  used  often.  For 
example  “the”.  This  word  is  likely  to  appear  in  every  document.  This  means  (assuming 
we  disregard  the  unity  offset  in  the  denominator  of  (3))  that  the  quotient  is  unity  and 
hence  its  log  is  zero,  given  the  term  zero  weight,  as  would  be  expected  for  a  common  word. 
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Definition  5.8  (Parts  of  Speech  (POS)  Tagging).  POS  tagging  refers  to  the  task  of 
classifying  individual  words  into  parts  of  speech.  In  English,  (and  in  German),  there  are 
eight  fundamental  parts  of  speech,  these  are  shown  in  Table  5  below.  One  could  further 

Table  5:  The  eight  basic  Parts  of  Speech  (POS) 


noun  pronoun  verb  adjective  article  adverb  preposition  conjunction 


refine  this  set  by  classifying  adjectives  as  descriptive,  demonstrative,  possessive  or  inter¬ 
rogative,  or  nouns  that  might  be  plural  or  singular,  see  [MS99]  page  3j2.  The  task  of 
mapping  given  words  to  the  types  in  the  table  above  is  not  trivial.  This  task  is  discussed 
at  length  in  the  literature  and  there  are  numerous  algorithms  available.  Indeed  the  POS 
tagging  problem  can  be  cast  as  a  Hidden  Markov  Model  (HMM)  problem  and  solved  with 
estimation  schemes  such  as  the  Viterbi  Algorithm.  A  good  treatment  of  this  approach  is 
given  in  [MS99]. 

Definition  5.9  (Bag  of  Words).  The  notion  of  the  so-called  Bag  of  Words  model  for  text 
is  central  to  all  the  text  analysis  methods  described  in  this  report.  In  this  model  (roughly) 
one  considers  text  as  invariant  to  word  order  and  invariant  to  grammar.  For  example, 
§5.^  from  the  2009  Australian  Defence  White  Paper 5  reads  as  follows: 

In  the  case  of  the  submarine  force,  the  Government  takes  the  view  that  our 
future  strategic  circumstances  necessitate  a  substantially  expanded  submarine 
fleet  of  12  boats  in  order  to  sustain  a  force  at  sea  large  enough  in  a  crisis 
or  conflict  to  be  able  to  defend  our  approaches  (including  at  considerable  dis¬ 
tance  from  Australia,  if  necessary) ,  protect  and  support  other  ADF  assets,  and 
undertake  certain  strategic  missions  where  the  stealth  and  other  operating  char¬ 
acteristics  of  highly -capable  advanced  submarines  would  be  crucial.  Moreover, 
a  larger  submarine  force  would  significantly  increase  the  military  planning  chal¬ 
lenges  faced  by  any  adversaries,  and  increase  the  size  and  capabilities  of  the 
force  they  would  have  to  be  prepared  to  commit  to  attack  us  directly,  or  coerce, 
intimidate  or  otherwise  employ  military  power  against  us. 

Taking  the  above  extract,  we  first  remove  (identify  in  red)  all  stop  words  and  grammatical 
marks,  with  the  following  result, 

In  the  case  of  the  submarine  force  ,  the  Government  takes  the  view  that  our 
future  strategic  circumstances  necessitate  a  substantially  expanded  submarine 
fleet  of  12  boats  in  order  to  sustain  a  force  at  sea  large  enough  in  a  crisis  or 
conflict  to  be  able  to  defend  our  approaches  ( including  at  considerable  distance 
from  Australia  ,  if  necessary  )  ,  protect  and  support  other  ADF  assets  ,  and 
undertake  certain  strategic  missions  where  the  stealth  and  other  operating  char¬ 
acteristics  of  highly- capable  advanced  submarines  would  be  crucial  .  Moreover 
,  a  larger  submarine  force  would  significantly  increase  the  military  planning 
challenges  faced  by  any  adversaries  ,  and  increase  the  size  and  capabilities  of 
the  force  they  would  have  to  be  prepared  to  commit  to  attack  us  directly,  or 
coerce  ,  intimidate  or  otherwise  employ  military  power  against  us  . 

5See  the  URL,  http://www.defence.gov.au/whitepaper/ 
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Finally,  we  delete  the  stop  words  and  write  out  the  final  text  subset  as  a  multiset,  showing 
multiplicities  of  repeated  words, 

case  submarine(x 4)  forcefix 4)  Government  view  future  strategic(x 2)  circum¬ 
stances  necessary(x2 )  substantially  expanded  fleet  12  boats  order  sustain  sea 
large  crisis  conflict  defend  approaches  including  considerable  distance  Australia 
protect  support  ADF  assets  undertake  certain  missions  stealth  operating  char¬ 
acteristics  highly- capable  advanced  crucial  larger  significantly  increase(x2 )  mil¬ 
itary^  2)  planning  challenges  faced  adversaries  size  capabilities  prepared  com¬ 
mit  attack  directly  coerce  intimidate  otherwise  employ  power 

Here,  the  blue  text  and  its  given  multiplicities,  might  be  thought  of  as  a  type  of  automated 
speed  reading  text,  where  the  filtered  multiset  of  interest  might  be, 

{submarine(x4) ,  force(x4),  strategic(x 2),  necessary(x2) ,  increase(x2) ,  military(x 2)}. 

Remark  5.3.  The  bag  of  words  representation  shoidd  not  be  confused  with  the  usual 
mathematical  notion  of  a  set,  as  it  allows  repeated  elements.  However,  this  representation 
may  be  thought  of  as  a  multiset  (see  http://mathworld.wolfram.com/Multiset.htmlJ 
which  is  a  set-like  object  allowing  multiplicity  of  elements. 

Remark  5.4.  As  an  aside,  the  Bag  of  Words  representation  of  text  is  also  used  in  some 
forms  of  SPAM  filtering,  where  one  “ Bag  ”  contains  words  in  legitimate  emails  and  another 
“Bag”  contains  words  of  a  potentially  dubious  nature. 

5.3  The  Notion  of  Exchangeable  Random  Variables 

In  what  follows  we  will  see  that  the  bag  of  words  representation  has  a  statistical  modelling 
significance  for  topic  estimation.  In  particular  its  invariance  to  orderings  of  words  which 
can  be  expressed  through  the  notion  of  exchangeability. 

Theorem  5.1  (Exchangeability,  Bruno  de  Finetti).  A  finite  sequence  of  random 
variables  {X\,  X2,  ■  ■  ■ ,  Xm }  is  said  to  be  exchangeable  if  its  joint  probability  distribution  is 
invariant  to  permutations.  That  is,  for  a  permutation  mapping  of  the  indexes  { 1,  2, . . . ,  n}, 
which  we  denote  by  vr(-),  the  following  equality  (in  probability  distribution)  always  holds, 

{-^7r(l) ;  ^(2)  ?  *  *  *  ?  -Vr(m)  f  d  {  A/ 5  A2 ;  •  ■  •  ;  Xm  j* .  (5) 

Here  the  equality  =d  means  equality  in  probability  distribution.  Detailed  proofs  of  this 
interesting  Theorem  can  be  found  in  [CT78]  and  [Durll]. 


5.4  Specific  Natural  Language  Processing  Routines 

In  this  section  we  describe  some  NLP  routines  which  are  used  to  convert  raw  text  data 
in  a  form  amenable  to  basic  tasks  of  interest,  for  example  probabilistic  topic  estimation. 
Most  of  these  tasks  are  essentially  basic  pre-processing,  however,  schemes  such  as  acronym 
detection  are  important  in  our  context  given  the  prevalence  in  Defence  of  acronyms  and 
identical  acronyms  with  different  meanings. 
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5.4.1  Acronym  Detection  Scheme 


In  general  most  modern  text  will  contain  acronyms,  however  text  collected  on  a  Defence 
subject  matter  will  almost  surely  contain  acronyms.  Moreover,  its  not  uncommon  in 
Defence  for  the  same  acronym  to  have  two  or  more  distinct  meanings.  Consequently  we 
implement  a  scheme  to  identify  acronyms.  This  scheme  is  used  to  identify  and  display 
acronyms  and  also  to  remove  all  acronyms  from  the  Bag  of  Words  reduction  of  the  text. 
The  algorithm  we  use  is  given  in  Algorithm  1. 


Algorithm  1  Algorithm  for  detecting  Acronyms 


1.  for  documents  i  =  1  to  /  do 

2.  for  words  i  =  1  to  Ll  do 

3.  if  word  is  identified  as  a  stop  word,  then 

4.  remove  word 

5.  end  if 

6.  if  the  word  contains  a  character  which  is  not  a  letter,  or  the  period  ,  then 

7.  remove  word. 

8.  end  if 

9.  if  more  than  half  of  the  characters  of  the  word  are  capitalised,  then 

10.  consider  word  as  a  candidate  acronym. 

11.  end  if 

12.  if  number  of  acronym  names  in  this  sentence  is  more  then  half  of  the  words 
in  this  sentence,  then 

13.  remove  all  these  words  from  the  candidate  list.  {If  a  sentence  is  all 
capitalised  its  words  will  not  be  classed  as  acronym} 

14.  end  if 

15.  if  word’s  length  is  >  6  characters,  then 

16.  remove  word  from  list  of  candidate  acronyms. 

17.  end  if 

18.  end  for 

19.  end  for 


5.4.2  Named  Entity  Recognition  (NER)  Scheme 


The  named  entity  recogniser  (NER)  code  used  in  this  work  is  publicly  available  through  the 
library  at  http://code.google.eom/p/nicta-ner/.  This  library  includes  an  Extractor, 
which  extracts  all  the  first-letter-uppercase  words  from  text,  and  combines  continuous 
words  into  phrases.  The  result  set  is  then  passed  to  a  Named  Entity  Classifier.  The 
Classifier  will  score  all  the  phrases  and  indicate  if  a  phrase  pertains  to  one  of  the  following 
three  classes:  1)  Location,  2)  Person,  or  3)  an  Organization. 
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5.4.3  Named  Entity  (NE)  Extractor 

The  NE  Extractor  is  also  a  scheme  developed  by  NICTA.  It  tokenises  the  text  by  using  a 
Java  standard  tokeniser  class.  This  scheme  is  package  independent  and  has  a  success  rate  of 
the  order  of  75%.  The  NE  Extractor  will  iterate  throughout  the  tokens  and  estimate  if  the 
tokens  correspond  to  Named  Entities.  Special  cases  arise  when  estimating  the  first  word 
of  a  sentence,  as  this  word  is  usually  first- letter- uppercase.  Further,  a  Word  Dictionary 
and  some  special  rules  are  also  implemented  to  determine  if  the  word  corresponds  to  a 
name  or  not.  The  Word  Dictionary  is  created  utilising  an  open  source  word  list  project 
(see  http://wordlist.sourceforge.net)  and  a  parts-of-speech  Tagger.  The  Dictionary 
can  be  used  to  determine  if  the  first  word  in  the  sentence  is  a  named  entity  in  most  cases, 
however,  exceptions  are  catered  for  with  special  rules. 

An  additional  function  of  the  NE  Extractor  is  the  extraction  Time/Date  phrases  from 
text.  The  NE  Extractor  can  be  used  to  determine  if  a  single  word  is  likely  to  be  a 
Time/Date  word  (for  example:  21:30,  3rd,  2001,  February).  If  a  time/date  phrase  is 
extracted  from  the  text,  this  “word”  will  be  added  to  a  Date-Phrase-Model  which  iterates 
to  the  next  word  in  the  text.  This  search  loop  will  terminate  if  any  word  that  does  not  look 
like  a  Time/Date  word  appears.  The  Date-Phrase-Model  will  determine  if  the  combination 
tagged  is  a  time/date  phrase  or  not.  For  example,  a  phrase  such  as  “16th”  is  not  a  date 
phrase,  as  it’s  ambiguous.  However,  the  term  “16th,  February”  is  unambiguously  a  date- 
phrase. 

5.4.4  Named  Entity  (NE)  Classifier 

The  NE  Classifier  we  use  is  a  scheme  developed  by  NICTA  which  implements  a  numeric 
scoring  of  classes  based  on  several  features.  This  classifier  receives  a  phrase  string  as 
input  and  returns  a  numeric  score  value.  Generally  two  classes  of  features  can  be  found 
by  this  classifier.  The  first  of  these  features  will  test  if  any  key-words,  or  key-phrases 
appear  embedded  within  the  given  phrase.  If  the  classifier  returns  “true”,  then  the  phrase 
contains  the  key- word  or  key-phrase  in  a  word  list.  The  second  type  of  feature  extracts 
the  information  in  the  context  such  as  the  attached  prepositions. 


6  Probabilistic  Topic  Modelling 

In  general,  the  two  basic  aims  or  probabilistic  topic  modelling  are:  1)  to  identify  and  rank 
a  finite  set  of  topics  that  pervade  a  given  corpus  and  2)  to  annotate/tag  documents  within 
a  corpus  according  to  the  topics  they  concern. 

Topic  modelling  is  an  increasingly  useful  class  of  techniques  for  analysing  not  only 
large  unstructured  documents  but  also  data  that  posit  “ bag-of-words ”  assumption,  such 
as  genomic  data  [FGK+05]  and  discrete  image  data  [WG08].  As  a  promising  unsuper¬ 
vised  learning  approach  with  wide  application  areas,  it  has  gained  significant  momentum 
recently  in  machine  learning,  data  mining  and  natural  language  processing  communities. 
In  this  section  we  review  fundamentals  (e.g.  basic  idea  and  posterior  inference)  of  topic 
models,  especially  the  Latent  Dirichlet  Allocation  (LDA)  model  by  [BNJ03]  that  acts  as 
a  benchmark  model  in  the  topic  modelling  community. 
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6.1  Probabilistic  Topic  Models 

Probabilistic  topic  models  [DDF+90,  Hof99,  HofOl,  BNJ03,  GK03a,  BJ06,  SG07,  BL09, 
Hei08]  are  a  discrete  analogue  to  principal  component  analysis  (PC A)  and  independent 
component  analysis  (ICA)  that  model  topic  at  the  word  level  within  a  document  [Bun09] . 
They  have  many  variants  such  as  Non-negative  Matrix  Factorisation  (NMF)  [LS99],  Prob¬ 
abilistic  Latent  Semantic  Indexing  (PLSI)  [Hof99]  and  Latent  Dirichlet  Allocation  (LDA) 
[BNJ03],  and  have  applications  in  fields  such  as  genetics  [PSDOO,  FGK+05],  text  and  the 
web  [WC06,  BSB08],  image  analysis  [LP05,  WG08,  HZ08,  CFF07,  WBFF09],  social  net¬ 
works  [MWCE07,  MCZZ08]  and  reconnnender  systems  [PG11].  A  unifying  treatment  of 
these  models  and  their  relationship  to  PCA  and  ICA  is  given  by  [B JOG] . 

Specifically,  probabilistic  topic  models  are  a  family  of  generative6  models  for  uncovering 
the  latent  semantic  structure  of  a  corpus  by  using  a  hierarchical  Bayesian  analysis  of  the 
text  content.  The  fundamental  idea  is  that  each  document  is  a  mixture7  of  latent  topics, 
where  each  topic  has  a  probability  distribution  over  a  vocabulary  of  words.  A  topic  model 
is  a  factor  model  that  specifies  a  simple  probabilistic  process  by  which  documents  can  be 
generated.  It  reduces  the  complex  process  of  generating  a  document  to  a  small  number 
of  probabilistic  steps  by  assuming  exchangeability.  While  the  model  just  described  is 
unrealistic  as  a  “true”  model  of  language  generation,  it  is  interesting  enough  to  generate 
useful  semantics  that  we  can  employ  in  understanding  a  collection.  Probability  density 
mixture  models  are  widely  used  in  stochastic  modelling,  in  particular  Gaussian  mixtures, 
for  example  see  [MP,  Mak71]  .  To  loosely  fix  the  basic  ideas  here  with  a  crude  example, 
we  might  begin  by  considering  the  probability  that  a  given  word  “ w'”  is  associated  to  one 
of  a  finite  collection  of  topics.  This  probability  might  be  modelled  as  follows, 

K 

P(w  =  w’)  =  J2p{w'  |  Tj )p(Tj ) , 

j= i 

K 

=  KATj)- 

3=1 

Here  Tj  denotes  topic  j.  The  Kj  may  be  thought  of  as  weights  in  a  convex  combination  of 
topic-proportion  probabilities. 

To  “generate”  a  new  document,  a  distribution  over  topics  (i.e. ,  a  topic  distribution )  is 
first  drawn  from  a  probability  distribution  over  finite  topic  vectors.  Then,  each  word  in 
that  document  is  drawn  from  a  word  distribution  associated  with  a  topic  drawn  from  the 
topic  distribution.  The  semantic  properties  of  words  and  documents  can  be  expressed  in 
terms  of  probabilistic  topics. 

6In  our  context  the  term  “generative  model”  refers  to  a  stochastic  model  used  to  generate  typical 
realisations  of  observed  data,  that  is  we  assume  there  exists  a  stochastic  model  whose  outputs  are  (loosely 
speaking)  a  document.  Such  models  typically  have  known  dependencies  upon  hidden/latent  parameters. 
The  usual  estimation  task  with  generative  models  is  to  apply  Bayesian  inference  to  compute/estimate 
hidden  variables,  such  as  topic  information  etc.  A  classic  example  of  this  class  of  estimation  task  is  the 
celebrated  Kalman  Filter. 

'  Here  the  term  mixture  refers  to  an  admixture,  for  example  concerning  population  modelling  see  the 
article  [SBF+11].  Some  synonyms  of  admixture  are,  blend,  alloy,  amalgamation. 
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In  what  follows  we  take  the  dimensions  for  a  topic  estimation  task  as  defined  through 
three  index  sets  (/C,V,Z),  with 


JC=  { 

1,2,.. 

■,K}, 

(7) 

V  =  { 

1,2,.. 

(8) 

^={ 

1,2,.. 

(9) 

Here  I\  denotes  the  number  of  topics,  V  the  size  (in  number  of  words)  of  the  vocabulary 
and  /  is  the  number  of  documents  in  the  corpus  being  studied. 

Formally,  a  topic  model  can  be  interpreted  in  terms  of  a  mixture  model  as  follows. 
Suppose  /u*  =  (fj\,  nl2,  ■  ■  ■  i  HlK)  is  a  document-specific  topic  probability  distribution,  here 
i  €  { 1 , 2, ...,/}  is  an  index  to  a  specific  document  in  the  corpus.  Further  p,lk  e  [0, 1]  and 

=  Pr(Document  i  contains/concerns  Topic  k ).  (10) 

Suppose  a  real- valued  V  x  K  (Vocabulary  x  Topics)  matrix  3>,  is  a  column- wise  collection 
of  topic-specific  word  probability  distributions.  It’s  important  here  to  note  that  the  matrix 
$  is  not  document-wise,  rather,  it  applies  to  the  entire  corpus.  To  fix  this  idea,  consider 
three  typical  topics  one  might  find  in  text  concerning  defence  capability.  These  might  be, 
for  example,  Force  Structure  Review  (FSR),  Fundamental  Inputs  to  Capability  (FIC8)  and 
the  topic  of  Submarines.  At  the  outset  one  would  expect  that  each  of  these  topics  might 
weight  certain  words  differently.  An  idealised  example  of  these  differences  is  shown  in 
Table  6.  Continuing  we  write  z\  E  1C  as  an  index/label  to  a  specific  topic- word  association 

Table  6:  Typical(idealised)  word-topic  allocation  probability  vectors  concerning  some  areas 
of  defence  capability.  Note  that  in  this  example  FIC  appears  in  two  different  topics,  here 
FSR  and  Submarine.  Note  also  that  the  number  of  most- significant  words  per  topic  may 
vary.  The  vertical  ellipses  indicate  that  the  shown  words  are  proper  subsets  of  a  greater 
vocabulary  which  we  denote  by  V . 


FSR 

FIC 

Submarine 

cost 

0.09 

basing 

0.031 

diesel 

0.136 

delivery 

0.034 

facilities 

0.027 

FIC 

0.042 

FIC 

0.034 

major  systems 

0.019 

nuclear 

0.025 

workforce 

0.033 

organisation 

0.013 

periscope 

0.011 

white  paper 

0.023 

personnel 

0.011 

deterrent 

0.011 

risk 

0.017 

support 

0.011 

snorkel 

0.009 

scenarios 

0.015 

sonar 

0.02 

torpedo 

0.012 

8  FIC  refers  to  a  canonical  list  of  components  which  collectively  form  a  defence  capability,  for  example: 
organisation,  personnel,  facilities,  command  and  management  etc. 


UNCLASSIFIED 


17 


DSTO-TR-2797 


UNCLASSIFIED 


for  word  w\,  where  £  €  {1,2 , ,L'1}  (here  Ll  G  N  is  the  number  of  words  in  document  i). 
The  basic  statistical  sampling  process  by  which  we  assume  a  particular  document  might 
have  been  created  is  as  follows, 

for  £=  1,2,..., V  z\  ~  /V  (11) 

we  |  {z\,  ~  (12) 

Here  and  throughout  this  report,  the  shorthand  notation  a\  (b,c)  ~  d  indicates  that  the 
random  variable  a,  given  (6,  c)  is  distributed  according  to  d.  The  function  F(-)  is  set,  in 
general,  to  be  a  discrete  distribution.  Further,  a  Dirichlet  distribution  is  assumed  as  a 
prior  for  /u.  The  hypothetical  output  of  the  model  just  described  would  be  a  document  of 
the  following  form, 


“Document”  =  (w2,  z2), . . .  (wL,  zL)}.  (13) 

Remark  6.1.  Its  important  here  to  understand  precisely  what  the  stochastic  generative 
process  just  described  actually  means.  No  “real”  intelligible  document  will  ever  be  gen¬ 
erated  from  such  a  process,  rather,  a  bag  of  words  collection  is  generated  with  topic-word 
association  according  to  the  probability  models  used.  This  is  quite  distinct  from  some  other 
more  well  known  stochastic  models,  for  example  a  discrete-time  Gauss-Markov  system  used 
in  a  typical  Kalman  filtering  target  tracking  problem.  In  that  setting  a  plausible  but  noise 
corrupted  target  track  is  generated. 


Applying  standard  Bayesian  inference  techniques,  we  can  invert  the  generative  process, 
(described  by  equations  (11)  &  (12)),  to  infer  the  set  of  optimal  latent  topics  that  max¬ 
imise  the  likelihood  (or  the  posterior  probability)  of  a  collection  of  documents.  Compared 
with  a  purely  spatial  representation  (e.g.,  Vector  Space  Model  [SM86]),  the  superiority  of 
representing  the  content  of  words  and  documents  in  means  of  probabilistic  topics  is  that 
each  topic  can  be  individually  interpreted  as  a  probability  distribution  over  words,  it  picks 
out  a  coherent  cluster  of  correlated  terms  [SG07].  We  should  also  note  that  each  word  can 
appear  in  multiple  clusters,  just  with  different  probabilistic  weights,  which  indicates  topic 
models  could  be  able  to  capture  polysemy9  [SG07].  This  generative  process  is  purely  based 
on  the  “ bag-of-words ”  assumption  where  only  word  occurrence  information  (i.e.  frequen¬ 
cies)  is  taken  into  consideration.  This  corresponds  to  the  assumption  of  exchangeability 
in  Bayesian  Statistics  (see  [BS94]).  However,  word-order  is  ignored  even  though  it  might 
contain  important  contextual  cues  to  the  original  content. 


6.2  Latent  Dirichlet  Allocation 

Latent  Dirichlet  Allocation  (LDA)  [BNJ03],  is  the  form  of  topic  model  considered  here. 
It  is  a  fully  Bayesian  extension  of  the  PLSI  model,  is  a  three-level  hierarchical  Bayesian 
model  for  collections  of  discrete  data,  e.g.  documents.  It  is  also  known  as  multinomial 
PGA  [Bun02].  One  can  show  that  two  other  popular  models,  PLSI  and  NMF,  are  closely 
related  to  maximum  likelihood  versions  of  LDA  [GK03b,  BJ06]. 

9The  term  polysemy  refers  to  a  symbol/sign/ word/phrase  having  a  diversity  of  meanings.  In  our  context 
this  refers  to  words  and/or  phrases  which  have  more  than  one  semantic  meaning. 
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Figure  3:  A  graphical  model  representation  of  Latent  Dirichlet  Allocation.  Here  the  only 
observed  data  are  the  words,  denoted  by  the  shaded  circle  containing  W 


As  a  fundamental  approach  for  topic  modelling,  LDA  is  usually  used  as  a  benchmark 
model  in  the  empirical  comparison  with  its  various  extensions  and  related  models. 

Figure  3  illustrates  the  graphical  representation  of  LDA  using  plate  notation  (see  [Bun94] 
for  an  introduction;  further  details  on  Graphical  models  can  also  be  found  in  [EdwOO], 
[Lau96b],  [CDLS99]  and  [KF09].  In  this  notation,  shaded  and  unshaded  nodes  indicate 
observed  and  unobserved  (i.e.  latent  or  hidden)  random  variables  respectively;  edges  in¬ 
dicate  dependencies  among  variables;  and  plates  indicate  replication.  For  example  the 
upper  most  plate  is  replicated  K  times  for  K  topics  and  the  largest  plate  is  replicated  I 
times  over  the  number  of  documents  in  the  corpus.  For  document  analysis,  LDA  offers 
a  hidden  variable  probabilistic  topic  model  of  documents.  The  observed  data  are  known 
word  collections 

Wl  =  {wl,wZ,...,wli},  (14) 

each  corresponding  to  a  document  in  the  corpus  of  the  I  documents  being  considered.  To 
denote  all  collections  of  words  in  the  corpus,  that  is  the  set  of  sets  {w  l}i—\ :  j,  we  use  the 
shorthand  notation  W  =  w1:I,  where 

w1:I  =  {w{, . . .  . . .  ,w2 * *L2, . . .  ,w{,  (15) 

Similarly  we  write 

(16) 

for  document  i’s  word-topic  associations  (these  quantities  are  integer- valued  with  zj  G  1C). 

As  above,  the  complete  corpus  collection  of  these  associations  are  denoted  by  Z  =  z1:I . 
Finally  we  write  /x  to  denote  the  collection  of  j/x1 ,  /i2, . . . ,  fi1 } .  The  two  classes  of  hidden10 
variables  we  consider  for  each  document  in  the  corpus  are: 

1.  The  topic  probability  distribution  which  we  denote  by  the  It -component  vector  /T 
(here  i  labels  the  i-th  document  in  the  corpus) 

2.  The  word-topic  assignments  which  are  {zf, . . . ,  z f  , } . 

1()Here  the  term  “hidden”  does  not  mean  hidden  as  in  common  parlance,  rather  it  means  a  quantity  that 

is  indirectly  observed. 
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Further  basic  model  parameters  are  the  Dirichlet-prior  parameters 

oc  =  {ai,  ai2,  ■  ■  ■ ,  &k},  (17) 

7  =  {7i)  72>  •  •  •  i  7l*}-  (18) 

These  parameters  are  for  topic  and  word  distributions  respectively.  Note  that  for  all  com¬ 
ponents  of  these  parameters,  G  M+  and  'yg  G  M+,  but  the  sums  J2k=i  ak  and  J2f=i  7 £ 
are  unconstrained.  Additional  model  parameters  are  the  word  probability  distributions 
(per  topic)  as  collected  column-wise  in  the  matrix  <!>. 

We  write  Dir/^(-)  to  indicate  a  iv-dimensional  Dirichlet  distribution.  The  LDA  model 
assumes  that  documents  are  consequences  of  the  following  generative  process: 

1.  For  each  topic  k  G  1C, 

(a)  choose11  a  word  probability  distribution  according  to,  ~  Duy(7). 

2.  For  each  document  i  G  X, 

(a)  choose  a  document-specific  topic  probability  distribution  according  to,  pi1  |  a  ~ 
DirA-(o:). 

(b)  For  each  l  G  {1, . . . ,  L1}, 

i.  choose  a  topic-word  association  according  to,  z\  |  /i*  ~  Discrete(/T), 

ii.  choose  a  word  according  to,  w\  \  (zf,  $|.  2ij.)  ~  Discrete($|. ,z^})- 

Here,  the  hyper-parameter12  7  is  a  Dirichlet  prior  on  word  distributions  (i.e.  a  Dirich¬ 
let  smoothing  on  the  multinomial  parameter  $  [BNJ03]).  The  model  parameters  can  be 
estimated  from  the  data.  The  hidden  variables  can  be  inferred  for  each  document  by  in¬ 
verting  the  generative  process,  which  are  useful  for  ad-hoc  document  analysis,  for  example, 
information  retrieval  [WC06]  and  document  summarisation  [AR08a,  AR08b].  With  this 
process,  LDA  models  documents  on  a  low-dimensional  topic  space13,  which  provides  not 
only  an  explicit  semantic  representation  of  a  document,  but  also  a  hidden  topic  decompo¬ 
sition  of  the  document  collection  [BL09]. 

Definition  6.1  (LDA  Count  Frequencies).  In  LDA  one  requires  the  observed  counts / frequencies 
for  the  respective  multinomial  probability  distributions.  We  denote  these  counts  by 

nk  =  (nj,n£,...,ra£r).  (19) 

Here  the  components  nk  denote  the  number  of  times  word  v  G  {1, 2, . . . ,  V}  is  assigned  a 
specific  topic  k  G  { 1 , . . . ,  K } . 

Similarly  we  write 

ml  =  (m\ ,  m*2, . . . ,  m'K) .  (20) 

11  Here  the  term  “choose”  means  a  stochastic  choice,  that  is,  sample  from  a  probability  distribution. 

12The  term  “hyper-parameter”  derives  from  Bayesian  statistics  and  refers  to  the  parameter  of  a  prior 
probability  distribution. 

13Note  the  number  of  topics  associated  with  a  document  collection  is  usually  far  smaller  than  the 
vocabulary  size,  since  documents  in  a  collection  tend  to  be  heterogeneous. 
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Here  the  components  mlk  denote  the  number  of  times  topic  k  is  assigned  to  some  word 
tokens  in  document  i.  Note  also  that  in  document  i,  the  following  sum  always  holds, 

K 

’E<=L<-  <21> 

k= 1 


The  integer-valued  quantities  raj  and  mk  are  functions  of  the  integer  values  z\  and  also 
w\,  and  so  may  be  computed  via  indicator  functions,  that  is, 


L\I 

^-{z\=k}^-{w\=v}i 

i= 1 

1=  1 

L' 

(22) 

■*■{ z\=k }• 

(23) 

i= l 


Given  the  two  Dirichlet  priors  parametrised  by  a,  and  7,  and  the  observed  document 
collection  W,  the  full  joint  probability  distribution,  including  the  the  latent  variables 
may  be  written  down  directly  by  combining  the  information  in  Figure  3, 
and  the  probability  distributions  given  above  for  an  LDA  generative  process.  This  joint 
probability  distribution,  conditioned  on  the  Dirichlet  priors  a  and  7,  has  the  form, 


K 


p{n,3>,Z,W  |  a,  7)  oc  n  ■piA\j£)p(w\  1  ${:^})  1  t)  1  a\ 


\i=\ 


Multinomial  Multinomial 


yk  1  Dirichlet  *  1  Dirichlet 


'  K 

n 


likelihood 

V 


prior 


L=,  Betay (7) 


n 


7,,+nJ-l 


K 


,  Beta*r(a)  ff 


n  (H) 


i  \ak+m\- 1 


(24) 


In  our  stochastic  generative  model  we  assume  the  following  samplings, 

o  the  variables  in  the  matrix  <&’s  are  corpus  level  variables,  which  are  assumed  to  be 
sampled  once  for  the  corpus, 

o  the  document  level  variables  /T’s  are  sampled  once  for  each  document, 

o  the  variables  zfs  are  word  level  variables  that  are  sampled  once  per  word  in  each 
document. 

We  note  that  the  probability  p(w\  |  $r.  2n)  simplifies  to  probability  value  <l>  <  i  7  ,  so  the 

l’  \ZVW£  / 

likelihood 

Z,w\a,~f)=  p(n,  Z,w\a,  7),  (25) 
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is  itself  a  product  of  likelihood  terms,  as  is  seen  by  the  the  properties  of  Dirichlet  distri¬ 
butions  (see  D.l).  Given  the  observed  document  collection  W,  the  basic  task  of  Bayesian 
inference  is  to  compute  the  posterior  probability  distribution  over  the  model  parameters 
$  and  the  latent  variables,  /x,  and  Z.  The  posterior  is 


p(/i,Z,  $  |  W ,  a,  7)  = 


p(fi,z,w&  I  a, 7) 


/  /  {  p(i*,z,w\an)}di*d* 

I Dom.(p)  •/Dom.($)  Dom.(Z)  J 


(26) 


Although  the  LDA  model  is  a  relatively  simple  model,  a  direct  computation  of  this  pos¬ 
terior  is  clearly  infeasible  due  to  the  summation  over  topics  in  the  integral  of  the  de¬ 
nominator.  Further,  training  LDA  on  a  large  collection  with  millions  of  documents  can 
be  challenging  and  efficient  exact  algorithms  have  not  yet  been  found  for  such  tasks,  see 
[Bun09].  Consequently  we  appeal  to  approximate-inference  algorithms,  in  particular:  the 
mean  held  variational  inference  [BNJ03],  the  collapsed  variational  inference  [TNW07],  the 
expectation  propagation  [ML02],  and  Gibbs  sampling  [GS04],  The  article  [BJ06]  has  given 
a  detailed  discussion  on  some  of  these  methods  and  suggested  alternatives,  such  as;  the 
direct  Gibbs  sampling  by  [PSDOO]  and  Rao-Blackwellised  Gibbs  sampling  by  [CR96].  Fur¬ 
thermore,  [WMM09]  has  studied  several  classes  of  structured  priors  for  the  LDA  model,  i.e. 
asymmetric  or  symmetric  Dirichlet  priors  on  /x  and  They  have  shown  that  the  LDA 
model  with  an  asymmetric  prior  on  /x  significantly  outperforms  that  with  a  symmetric 
prior.  However,  there  are  no  benefits  from  assuming  an  asymmetric  prior  for  $. 

Out  of  all  proposed  approximate  inference  algorithms,  each  of  which  has  advantages 
and  disadvantages,  hereafter  we  focus  on  the  collapsed  Gibbs  sampling  algorithm  intro¬ 
duced  in  [GS04],  details  can  be  found  in  [SG07].  The  collapsed  Gibbs  sampler  is  found  to 
be  as  good  as  others.  It  is  also  general  enough  to  be  a  good  base  for  extensions  of  LDA. 


6.3  Numerical  Implementation 

Gibbs  sampling  [GG90]  is  a  special  case  of  the  Metropolis- Hastings  algorithm  in  the  Markov 
chain  Monte  Carlo  (MCMC)  family.  The  first  collapsed  Gibbs  sampling  algorithm  for  the 
LDA  model  is  proposed  by  [GS04]  and  is  known  as  the  Griffiths  and  Steyvers’  algorithm. 
It  marginalises  out  /x  and  <1?  from  Equation  (26)  using  the  standard  normalising  constant 
for  a  Dirichlet.  The  strategy  of  marginalising  out  some  hidden  variables  is  usually  referred 
to  as  “collapsing”  [NeaOO],  which  is  the  same  as  Rao-Blackwellised  Gibbs  sampling  [CR96]. 
The  collapsed  algorithm  samples  in  a  collapsed  space,  rather  than  sampling  parameters 
and  hidden  variables  simultaneously  [TNW07].  So,  the  Griffiths  and  Steyvers’  algorithm 
is  also  known  as  a  collapsed  Gibbs  sampler. 

The  principle  of  Gibbs  sampling  is  to  simulate  the  high-dimensional  probability  dis¬ 
tribution  by  conditionally  sampling  a  lower-dimensional  subset  of  variables  via  a  Markov 
chain,  given  the  values  of  all  the  others  are  fixed.  Essentially  this  means  reducing  a  large 
joint  probability  distribution  to  a  collection  of  univariate  probability  distributions.  The 
sampling  proceeds  until  the  chain  becomes  stable  (i.e.  after  the  so-called  “bum-in”  period, 
the  chain  will  burn-in  to  a  stable  local  optimum).  Theoretically,  the  probability  distribu¬ 
tion  drawn  from  the  chain  after  the  “ burn-iri ’  period  will  asymptotically  approach  the  true 
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posterior  distribution.  In  regard  to  the  LDA  model,  the  collapsed  Gibbs  sampler  consid¬ 
ers  all  word  tokens  in  a  document  collection,  and  iterates  over  each  token  to  estimate  the 
probability  of  assigning  the  current  token  to  each  topic,  conditioned  on  topic  assignments 
of  all  other  tokens. 

To  derive  the  conditional  distributions  of  interest,  we  first  need  to  compute  the  joint 
distribution  over  Z  and  W,  conditioned  on  the  hyper-parameters  a.  and  7.  From  equation 
(24)  we  see  that 


p(Z,W  |  a, 7)  = 


'Dom.(^)  JDom.(fi) 
K 


p(Z,  W ,  $>,n\  a,  7 )d^dfi 


oc 


[  f  I  7)]W  I  I  I 

'  J  k=l  i=  1  1=1 

K  I  L  \ 

1 7)nn^w  i  *{:,*•})  w x 

i=*  £=1  / 

I<  L  \ 

0 P(fi  I  a)UM  I  /**)  W  • 


^fc=i 


\k= 1 


t=l 


(27) 


Remark  6.2.  For  brevity  we  have  assumed  all  documents  are  of  the  same  length  in  respect 
of  number  of  words,  that  is  Ll  =  L,  Vi  £  {1,2, . . . ,/}.  This  simplification  does  not  effect 
the  end  results  for  a  corpus  containing  different  document  lengths. 


To  further  evaluate  the  RHS  of  equation  (33)  we  consider  the  last  integral  (immediately 
above)  against  /x.  Due  to  the  assumptions  of  independence  we  note  that 


1  a)ii  p(:4  I  =  ■■■  p^1  I  a)p(fjL 2  I  a)  •  •  -pin1  |  a)x 

i= 1  t=  1  d  J 

L  L  L 

n  p(ze  1  ^Ylp^e  I  ^2)---YIp(4  I  ^)dnldn2  ■  ■  ■  dp,1 

1=1  1=1  i=l 

T  ,.  L 

= n  /  p^  1  a)  1 


1=1 


(28) 


To  evaluate  the  previous  product  of  integrals  we  need  only  consider  the  ith  integral.  First 
we  recall  the  explicit  form  of  the  Dirichlet  probability  density,  that  is, 


p(/T  |  a)  \\p{z\  |  pl)dpl 


Dirichlet 


1=1 


(nEti»4 

KUti^k) 


n(4r 


k= 1 


L 


n  pm  1 

1=1 


(29) 


The  remaining  probability  density  in  (29)  is  a  multinomial  over  the  word-topic  association 
indicators  z\.  Recall  that  z  is  integer  valued  and  can  take  one  of  1,  2, ... ,  K  values,  further 
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several  words  in  any  given  document  may  take  the  same  z  value  and  the  counts/frequencies 
of  these  allocations  in  document  i  are  denoted  by  mlk.  Recalling  the  sum  at  equation  (21), 
we  note  that  (ignoring  normalization  constants) 


K 


Y[p(4 1  /**)  oc  nuy 


i=i 


k=  1 


(30) 


Consequently  the  integral  at  (29)  may  be  written  as, 


I  ol)\\p{z\  | 


oc  / 1  ^?^=i.afc) 


,nf=irK)  tv 

II  0&rt'+ak-1)di*i. 
,  FU=i  r(afc)  fc=i  / 


£=1 


3= 1 


(31) 


Its  clear  that  the  integrand  in  the  last  term  above  is  somewhat  close  to  a  Dirichlet  proba¬ 
bility  density,  except  its  normalisation  term  does  not  match  the  parameters  in  its  product 
term.  To  complete  this  calculation  and  eliminate  /T,  we  use  the  fact  that  the  integration 
over  any  probability  density  is  unity,  that  is, 


'r(££=i«*) 
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nt/ir^-W  =  fllM) 


m=i  r(afc)y-  fe=1 
'r(£f=1afc)\ 
rifcLi  r(afc)y 

f  nti  r(»4 + Qfc) 

'K 


r(Efc=i(mfc  +  «fc)) 


Lvr(Efe=iK  +  «fc))  J  V  n^i  r("4 + «*) 


=i 


A' 


nw 


i  \m? +afc-l 


V, 


fc=l 

^XEfe= 

n 


(Efeigow  nfeir 

I^FK)/  Vr(Eti 

r(Efc=i(mi+«fc)) 


=1  r("4  +  «fc) 


m 


+  afc)) 


A' 


nti  p(K + «*) 


nw 

fc=i 


i  \ml+ak- 1 


V, 


Ef=i«*)W  nf=ir 


£(Ef= 

n 


=i 

=i  T(mi  +  «fc) 


=i("ifc +«fc)). 


(32) 


Similarly  one  can  marginalise  out  $  in  equation  (33)  via  calculations  similar  to  those 
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Algorithm  2  Major  cycle  of  Gibbs  sampling  for  LDA 


1.  initialise  each  z\  randomly  to  a  topic  in  {1, K} 

2.  for  documents  i  =  1  to  /  do 

3.  for  words  i  =  1  to  U  do 

4.  z\  ~  p{z\  =  k  I  Z\z},W,a,-y) 

5.  end  for 

6.  end  for 


above,  the  result  being, 


i 

P(Z,W  |  a, 7)  oc 

i=l 


fn zL*hl) (  nf.i r(mj + at)  \ 
ln£.r(at)ArCL.(K+“*))/ 


over  documents 

K 

n 

k= 1 

Yr<Ehl7,)\| 

1  nLiUnJ  +  l.) 

lAnUiW1 

Vr(ELlK+7r)) 

over  topics 


(33) 


Equation  (33)  provides  the  required  proportionality  for  the  LDA  collapsed  Gibbs  Sampler, 
with  both  quantities  /x  and  $  marginalised  out.  The  relation  at  (33)  may  be  used  to 
compute  a  univariate  sampling  density  for  generating  realisations  of  Z,  that  is,  a  sampling 
density  of  the  form 

p(4  =  fc|Z\4,W,a,7).  (34) 

Here  we  write  a  version  of  the  set  difference  symbol  to  indicate  all  of  Z  except  the  element 
Zg,  that  is 

z\4=  {z\,Z2T--z1Luzl,zl,...z2L2---z{,zI2,...zILl}  \z\.  (35) 

Recall  the  main  objective  of  the  collapsed  Gibbs  sampler  is  to  sample  from  the  univariate 
probability  =  k  \  Z  \  W,  a,  7) . 

Continuing,  we  note  the  direct  proportionality  relationship 

p(z \  =  k  |  Z  \  z},W,  a,  7)  oc  p(Z,W  \  a,  7).  (36) 


Further,  due  to  statistical  independence  we  need  only  consider  terms  affected  by  the  states 
z\  =  k.  This  is  simplified  by  the  key  property  of  the  Gamma  function,  so  T(n+a+l)/r(n+ 
a)  =  n  +  a.  With  a  cancellation  of  factors  the  full  conditional  distribution  can  be  derived 
as 


p(4 


k\z1:I  \  z},w1:1 ,0,7)  oc 


I]  (nv  +  >) 

V=1 


mlk  +  ak 

{m\  +  ae) 
1=1 


(37) 


The  Gibbs  algorithms  for  sampling  the  topic  assignments  Z  is  given  in  Algorithm  2.  After 
a  sufficient  number  of  Gibbs  cycles,  which  means  the  sampler  has  burnt  in,  the  Markov 
chain  is  ready  to  sample.  Given  the  posterior  sample  statistics,  the  latent  variable  /x  and 
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the  model  parameter  $  can  be  estimated  at  any  one  major  cycle  using  the  expectation  of 
the  Dirichlet  distribution  (see  §D)  as: 


$ 


n. 


lv 


V 


G  (0,1], 


V=1 


Y  (nv+7v 


A  ml  +  ak 
Vk  ~  K 

Y  (m* + ak ) 

k= 1 


G  (0, 1]. 


(38) 


(39) 


Standard  MCMC  methodology  suggests  these  values  should  be  averaged  across  a  set  of 
major  cycles  after  “burn  in” .  Then  an  estimation  strategy  might  be: 


1.  “Burn  in”  for  400  cycles. 

2.  Continue  major  cycles,  and  at  every  20-th  major  cycle, 

(a)  for  k  =  1, ...,  K,  compute  the  estimate 

and  for  i  =  1, ...,/,  compute  the  estimate  /T, 

(b)  compute  running  averages  using  these  two  estimates. 


6.4  MCMC-based  LDA  and  the  Uniqueness  of  Outputs 

As  detailed  above,  the  basic  outputs  from  the  LDA  implementation  we  described  are 
the  estimated  topic  probabilities  for  each  document,  which  we  denoted  by  the  vector  /£*. 
In  the  previous  section  we  described  how  these  quantities  would  be  estimated  using  a 
standard  collapsed  Gibbs  sampling  scheme.  Given  Gibbs  sampling  is  a  form  of  Markov 
Chain  Monte  Carlo,  this  naturally  means  the  outputs  of  our  LDA  are  themselves  inherently 
random.  This  fact  immediately  raises  natural  questions,  for  example;  how  might  I  best, 
if  at  all,  interpret  the  statistics  of  the  LDA  outputs  ?  Does  it  make  sense  to  compute 
the  estimated  variance  of  these  LDA  outputs  and  somehow  describe  such  a  variance  as  a 
measure  of  quality  ?  Ideally  we  would  like  such  a  variance  to  be  small,  as  this  might  offer 
some  confidence  to  the  analyst  that  he/she  is  near,  (in  some  sense),  to  the  true  values. 
Unfortunately,  the  answers  to  these  important  questions  are  not  so  simple. 

The  theory  of  this  type  of  convergence  analysis  is,  as  yet,  not  exact  or  well  defined. 
Indeed,  precise  convergence  bounds  in  this  setting,  when  developed,  require  large  numbers 
of  samples.  As  an  example  of  the  practice  of  so-called  convergence  diagnostics,  see  [CC96]. 
Note  however,  this  technique  is  about  assessing  convergence  of  estimates;  it  has  some 
theoretical  justification,  but  is  based  largely  upon  heuristics. 

The  origin  of  this  problem  in  applying  LDA  to  topic  modelling  is  a  basic  one  of  rep¬ 
resentation  and  modelling.  Suppose  we  consider  K  possible  topics.  Then  the  LDA  model 
for  this  scenario  has  in  fact  K !  =  K  x  K  —  1  x  K  —  2  x  •  •  •  x  1  equivalent  models,  all  of 
which  are  indistinguishable.  To  put  this  more  simply,  recall  the  matrix  <&.  This  matrix  is 
a  vocabulary  x  topics  matrix  of  probabilities.  However  there  is  no  good  reason  to  identify 
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any  particular  topic  with  any  particular  column  in  this  matrix.  This  means  we  can  con¬ 
sider  all  of  the  K\  column-wise  permutations  of  $  as  valid.  Moreover,  similar  arguments 
apply  to  the  ordering  of  the  components  in  the  vectors  /T. 

Now,  recalling  our  original  concerns,  suppose  we  run  the  MCMC  form  of  LDA  twice 
on  the  same  text  data  set  and  thereby  generate  two  outputs,  which  will  almost  surely 
not  be  the  same.  These  outputs  cannot  be  easily  compared,  as  their  outputs  may  have 
arisen  from  the  K\  indistinguishable  models.  This  issue  is  non  trivial  and  remains  an  as 
yet  unsolved  and  important  problem  concerning  the  use  of  these  LDA  methods. 


6.5  Visualisation  for  Topic  Estimation 

Ultimately  all  estimated  probabilities  are  of  the  type  fi(Tj  \  Dl).  Here  Tj  denotes  topic 
j  and  D 3  denotes  “document”  j,  which  may  well  be  a  text  “idea/response”  rather  than 
an  entire  document  need  to  be  projected,  by  some  means,  onto  a  2-dimensional  space. 
For  example,  if  the  topic  number  is  4,  then  for  each  DJ ,  our  algorithm  will  compute  an 
estimated  (and  normalised)  probability  vector  as  follows, 

V  =  (Aim  I  D%fo(T2  I  D%^(T3  I  ZT),A m  I  Z?®)) 

G  [0, 1]  x  [0, 1]  x  [0, 1]  x  [0, 1]  €  K4. 

Suppose  we  consider  K  €  N  for  a  number  of  topics.  To  display  the  sites/locations  of  the 
K  topics  and  estimated  probabilities  shown  in  (40),  we  first  arrange  the  K  topic  “centres” 

as  equidistant  points  on  a  circle,  that  is  C  =  { (XT] ,  Y11), . . . ,  (X1k ,  YTk )}  denotes  the 
collection  of  “centres”  appropriately  chosen  according  to  a  given  display.  Note  that  these 
points  are  not  plotted  on  the  screen,  rather,  they  act  as  points  of  location  for  each  topic. 
Next  the  collection  of  all  estimated  probabilities  /t (Tj  \  Dl)  are  plotted  relative  to  the 
collection  C.  The  plotting  scheme  used  is  a  convex  combination  formulation,  for  example, 
the  2-dimensional  coordinates  for  the  estimated  probability  concerning  Dl,  are  computed 
by, 

{Xd\Ydi)  =  Y,  m  I  ZT)  X  (XTi,YTi).  (41) 

3= 1 

Remark  6.3.  It  should  be  clear  that  the  projection  given  by  (jl)  is  an  approximation  and 
that  information  will  be  lost  when  projecting  from  K  >  2  down  to  2  dimensions  onto  a 
visualisation  screen.  Moreover,  this  projection  is  naturally  not  unique.  When  projecting 
from  K  >  2  downs  to  2  dimensions,  it  is  possible  that  distinct  points  in  the  probability 
simplex,  may  be  projected  to  the  same  ( XD  ,YD  )  coordinates.  In  contrast  parallel  co¬ 
ordinate  plots  are  unique. 


7  Differential  Analysis  of  Stake-Holder  Text 

In  previous  sections  of  this  paper  we  restricted  our  attention  to  the  task  of  topic  modelling 
based  upon  stochastic  generative  models  and  also  considerable  pre-processing  of  text,  such 
as  the  NLP  routines  described  in  §5.2  and  .  In  differential  analysis  we  are  concerned  with 
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Figure  f:  This  figure  shows  a  real-data  example  for  9  topics  projected  onto  2D.  The 
clusters  of  words  at  the  topic  centers  list  the  top  (probabilistically)  5  words  associated  to  a 
given  topic.  The  details  on  these  plots  is  given  in  section  6.5. 


a  slightly  different  problem.  Suppose  a  collection  of  two  stake-holder  groups  assemble  in 
the  JDSC  for  an  investigation  concerning  Chinook  Helicopters  and  the  use  of  their  hoists. 
Suppose  the  two  stake-holder  groups  are  ARMY  and  NAVY,  both  of  which  might  make 
use  of  these  Helicopters.  In  a  typical  workshop  a  component  discussion  may  occur  on, 
say,  the  best  operational  usage  of  a  Chinook  Hoist.  Through  networked  text  collection 
software,  participants  might  enter  text  offering  subject  matter  expert  opinions  on  such 
a  topic.  With  this  basic  example  in  mind,  differential  analysis  (in  the  JDSC  context) 
concerns  estimating,  quantifying  or  illuminating  the  differences  between  two  (or  more) 
stake-holder  subgroup  text  responses  on  the  same  topic/question.  Consequently  we  might 
like  to  estimate  basic  notions  such  as  bias,  polarity  or  some  measure  of  sentiment  indicating 
the  “strength”  of  responses. 

The  notion  of  differential  analysis  of  Text  immediately  raises  certain  preliminary  ques¬ 
tion,  for  example,  what  do  we  mean  by  difference  ?  Is  it  semantic  difference,  or  some 
measure  of  frequency  difference  ?  Moreover,  what  exactly  do  we  intend  to  compare  ? 
Is  it  sentences,  noun  phrases,  paragraphs,  or  entire  documents  ?  The  literature  has  ad¬ 
dressed  many  of  these  questions,  see  the  following  for  some  recent  and  interesting  examples 
[SDK11,  CM05,  LPW05,  Pin04,  TVV10,  TP09,  GST07]. 

7.1  Quantifying  Sentiment 

The  approach  we  take  in  this  work  is  to  examine  sentiment  scores  of  the  text  responses 
elicited  through  JDSC  workshops.  Sentiment  scores  are  widely  used  in  opinion  mining,  see 
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for  example  the  well  known  resource  SentiWordNet  at  the  URL  http : //sentiwordnet . 
isti  .  cnr .  it.  The  (several  generations)  history  and  details  of  SentiWordNet  are  discussed 
in  the  articles  [ES06]  Sz  [BES10].  See  also  [TL03] 

The  basic  task  here  is  to  score  words  (numerically)  as  positive  or  negative.  An  ad¬ 
ditional  challenge  is  to  infer  (via  sentiment)  if  text  is  subjective  or  objective.  To  score 
sentiment  one  typically  requires  a  lexicon.  NICTA  has  developed  its  own  proprietary  sen¬ 
timent  score  lexicon,  which  has  been  applied  in  this  work.  A  numeric  score  for  a  given  text 
idea/response/entry  is  determined  then  converted  to  a  colour  with  the  obvious  extrema 
of;  strong  red  for  extremely /max  negative  sentiment  and  strong  green  for  extremely /max 
positive  sentiment.  We  first  consider  this  at  a  word  level  only  (in  the  Key  Phrase  section 
below  we  also  apply  sentiment  scores  to  phrases).  Recall  that  we  denote  a  document  IT 
by  the  set, 

Dl  =  {w\,u>2, . . .  ,wlLi}.  (42) 

Further,  write  p(-)  :  V  — >  [—1, 1]  for  a  sentiment  function,  which  maps  a  given  word  in  the 
lexicon  to  a  real  number  in  the  shown  range.  In  our  sentiment  analysis  we  consider  two 
types  of  computation,  the  first  is  a  sentiment  score  per  document,  the  second  involves  a 
user  query,  with  sentiment  then  computed  around  all  instances  of  the  given  query.  In  the 
first  case  we  compute  the  sentiment  score  of  the  document  Sent. (IT)  by  a  sum,  that  is, 

Li 

Sent. (IT)  =  ^p{wlj).  (43) 

3= 1 

The  calculation  shown  at  (43)  is  evaluated  for  all  documents  in  the  corpus  {Dl,  D 2, . . . ,  iT} . 
Subsequently,  each  score  is  rescaled  to  [—1,1]  through  the  normalisation  Sent. (IT)/M , 
where 

M=  max  {|Sent.(ZT)|l.  (44) 

ie{l,2,...,J}  1  J 

An  example  of  this  calculation  is  shown  in  Figure  4,  here  each  dot  on  the  simplex  labels 
a  document  and  its  colour  indicates  the  computed  sentiment. 

If  the  user  provides  a  single  word  query  q  £  V,  then  the  sentiment  calculation  is 
different.  In  this  case  document  sentiment  is  only  computed  for  documents  containing  q 
one  or  more  times.  Suppose  a  document  contains  a  given  query  three  times,  that  is 

D  { W-y  -  %  -  ,  .  .  .  ,  qg 2  ,  .  .  •  ,  q f 3  ,  •  •  •  ,  'j^i  {  (4^6) 

Here  the  query  word  is  located  at  the  indices  To  compute  sentiment  for  this 

document,  with  respect  to  this  query,  we  take  a  type  of  windowed  average  around  each 
occurrence  of  the  query.  More  generally  we  write  0  <  N*  <  Li  for  the  number  of  matches 
found  in  IT  against  the  query  q.  The  sentiment  score  for  this  document  is  now  computed 
by, 

i  c 

Sent.  (IT,  g)  =  —  J]{p(g)  +  +  P(wj+i)}  }•  (46) 

11 ies  i=i 

Here  S  is  the  set  of  indices  locating  the  matches  of  q  in  the  document  and  |5|  is  the 
cardinality  of  this  set.  The  window  size  is  2 C  +  1.  All  outcomes  of  the  calculation  shown 
at  (46)  are  normalised  as  above  to  the  set  [—1,1].  The  software  developed  in  this  project 
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also  has  a  capability  for  multi-word  queries,  for  example  “diesel  submarine”,  or  “ARMY 
Helicopter”.  In  this  case  the  formulation  of  sentiment  scoring  is  similar  to  that  of  (46), 
however  the  score  is  computed  around  instances  of  the  submitted  word  set.  Further,  if, 
for  example,  a  submitted  word  set  is  “diesel  submarine”,  these  two  words  need  not  be 
adjacent  in  a  given  document. 


7.2  Visualisation  for  Sentiment  Allocations 


In  this  example  for  two  stake-holder  groups  only,  we  consider  data  derived  from  the  unclas¬ 
sified  data  set  collected  by  the  JDSC  in  the  activity  described  in  §4.  In  Figure  5  we  show  a 
single  word-based  score  for  the  two  groups,  NICTA  and  DSTO.  This  example  is  included 
only  to  illustrate  the  sentiment  analysis  for  two  groups.  No  substantial  conclusions  can  be 
drawn  from  this  data  set,  it  is  used  for  illustration  purposes  only. 


On  the  left  and  right  of  the  plot  are  two  vertical  histograms  for  the  top  N  words  with 
respect  to  a  basic  frequency  count  (per  stake- holder  group).  These  words  are  displayed 
vertically  from  lowest  to  highest  in  frequency.  The  colours  of  these  histograms  indicate 
the  sentiment  for  how  the  (respective)  groups  used  the  words.  Given  there  is  no  reason  to 
assume,  (in  the  case  of  two  stake-holder  groups),  that  each  will  contribute  the  same  volume 
of  text,  frequency  of  usage  is  computed  relative  to  a  group’s  contribution.  For  example, 
suppose  two  groups  are  considered,  Surface  Navy  and  Submariners,  further  suppose  the 
Surface  Navy  contribute  twice  as  much  text  (on  a  particular  topic)  as  the  Submariners 
but  both  use  the  word  torpedo  in  equal  proportion.  In  this  case  the  frequency  of  usage  of 
this  word  will  be  scored  the  same  for  both  groups. 


In  the  centre  of  Figure  5  we  show  an  (x,  y )  graph  depiction  of  all  words  corresponding 
to  the  frequency  histograms.  Note  that  if  the  basic  list  of  words  contains  20  entries,  then 
there  will  be  40  corresponding  ( x ,  y )  points  shown  on  this  plot,  which  indicate  how  the 
same  20  words  were  used  by  both  groups  in  respect  of  (jointly)  frequency  and  sentiment. 
All  (x,  y)-located  words  are  tagged  by  colour  to  indicate  which  group  they  belong  to.  The  y 
component  represents  the  sentiment  value  of  a  word  ascending  from  lowest  (most  negative 
or  red)  to  highest  (most  positive  or  green).  The  x  coordinate  for  a  given  word  shows 
the  relative  frequency.  This  axis  (for  2  stake-holder  scenarios)  is  really  two  frequency 
domains  shown  side  by  side.  Tagged  words  in  the  exact  centre  of  the  plate  indicate  equal 
usage.  To  give  an  example  of  how  frequencies  are  computed,  we  write  /dsto(w^)  and 
/nicta(u^)  to  denote  the  frequency  of  usage  of  the  word  w\  by  the  stake-holder  groups. 
In  the  depiction  of  Figure  5  the  stake-holder  group  of  DSTO  is  shown  on  the  left  side 
of  the  figure.  Consequently,  DSTO-coloured  points  to  the  left  of  the  centre  of  this  plot 
indicated  higher  frequency  of  use  by  DSTO.  These  particular  frequencies  (x  locations)  are 
computed  by 


x(w\) 


A 


_ /dsto(w \) _ 

/dsto(wJ)  +  /nicta  (w})- 


(47) 


With  x  just  defined  the  screen  coordinates  for  a  given  (DSTO)  word  are  computed  as, 


(x,y)  =  (x(w}),Sent.(w})y 


(48) 


The  corresponding  points  for  NICTA  are  computed  similarly. 
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Figure  5:  This  figure  shows  a  two  stake-holder  differential  analysis  at  a  word  level  only. 
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8  Key-Phrase  Identification 

The  common  word  “phrase”  has  a  diverse  variety  of  meanings.  For  example,  in  music 
“phrase”  has  a  clear  meaning,  it  is  a  unit  of  music  that,  if  extracted,  has  complete  musical 
sense  on  its  own.  In  text  analysis  the  notion  of  a  phrase  is  somewhat  similar  in  respect  of 
what  it  might  convey,  however,  the  classification  of  phrases  in  text  is  far  more  sophisticated 
and  developed.  In  text,  phrases  can  be  classed  in  a  variety  of  types,  for  example:  noun 
phrases,  adverbial  phrases,  adjectival  phrases  and  prepositional  phrases,  to  name  just  a 
few.  Consequently,  we  need  to  identify  which  class  of  phrases  should  be  identified  and 
analysed  in  text  data  elicited  in  a  defence  capability. 

In  NLP  the  task  of  identifying/tagging  phrases  in  text  has  received  considerable  atten¬ 
tion  and  a  variety  of  schemes  are  available  to  address  such  tasks.  In  this  work  we  examine 
the  Multi- Word-Term  methods  described  in  [FAM98]  and  some  more  basic  frequency  and 
sentiment  based  identification. 


8.1  Definitions  and  Basic  Theory 

The  key-phrase  detector  we  consider  here  is  due  to  K.  Frantzi,  S.  Ananiadou  and  H.  Mima, 
see  [FAM98].  This  paper  also  cites  the  following  directly  related  PhD  Theses,  [Fra98], 
[Ana88]  and  [Lau96a].  Some  further  publications  of  relevance  are:  [KU96],  [DPL94]  and 
[Dun93].  Unfortunately,  a  detailed  exposition  of  the  interesting  results  in  [FAM98]  is  well 
beyond  the  scope  of  this  report,  however,  we  give  here  a  brief  overview  of  its  relevant 
components.  At  the  outset  we  note  that  our  term  “key-phrase”  refers  to  a  Multi  Word 
Term  (MWT),  typically  two  or  three  words  in  length.  The  task  of  identifying  such  “terms” , 
is  usually  referred  to  as  Automatic  Term  Recognition  (ATR). 

The  method  developed  in  [FAM98]  is  called  (by  the  authors),  the  C-value/iVC-value 
method.  Roughly,  this  method  proposes  the  combination  of  two  classes  of  text  information 
to  perform  ATR,  these  are;  1)  linguistic  information  (the  so-called  C-value)  and  2)  statis¬ 
tical  information  (the  so-called  A" C'- value).  Processing  the  linguistic  information  requires 
three  components;  parts-of-speech  tagging,  linguistic,  filters  and  a  specific  list  of  stop  words 
which  are  not  expected  to  occur  in  MWTs.  The  basic  notion  of  parts-of-speech  tagging  was 
briefly  described  in  §5.2.  Each  of  these  three  components  will  have  performance  depen¬ 
dent  upon  the  type  of  text  being  examined  and  the  specific  algorithms  being  implemented 
(there  are  numerous  schemes  for  parts-of-speech  tagging).  An  interesting  aspect  of  the 
U- value / N 6’- value  method  is  it  uses  a  type  of  hybrid  linguistic  filter.  Before  describing 
this  filter  we  recall  the  notion  of  regular  expressions  as  they  appear  in  Computer  Science. 

Definition  8.1  (Regular  Expressions).  In  computer  science,  regular  expressions  are 
a  special  type  of  logical  statement  specifically  constructed  for  searching  and  matching  pur¬ 
poses,  such  as  matching  given  strings  or  given  characters  etc.  More  detail  on  regular 
expressions  can  be  found  in  the  well  known  text  [SW11],  or  the  URL  http:  //  introcs . 
cs.  princeton.  edu/  java/ 72regular/ .  Various  symbols  are  used  in  regular  expressions. 
A  brief  relevant  list  of  these  symbols  is  shown  in  Table  7. 

The  linguistic  filter  component  of  [FAM98]  is  a  hybrid  or  combination  of  three  types 
of  linguistic  filters.  These  filters  each  search  for  matches  against  certain  parts-of-speech 
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Table  7:  Some  regular  expression  symbols 


Symbol 

Meaning 

* 

0  or  more 

? 

{0,1} 

+ 

one  or  more  occurrences 

logical  OR 

combinations  or  specific  sequences  of  parts-of-speech,  see,  for  example  [Bou92],  [DC95] 
and  [DGL94],  Using  the  abbreviation  Adj.  for  an  adjective  and  NounPrep  for  a  noun 
preposition,  and  the  symbols  in  Table  7,  these  three  linguistic  filters  are  written: 

1.  Noun+Noun, 

2.  (Adj.  |  Noun  )+Noun, 

3.  ^(Adj.  |  Noun)+  |  ^(  Adj.  |  Noun  )*(  NounPrep  )?^  (  Adj.  |  Noun  )*j  Noun. 

To  explain  this  further  by  way  of  example,  consider  the  second  linguistic  filter  above. 
Recall  that  the  exponent  of  +  means  one  or  more  occurrences  and  (  Adj.  |  Noun)  is  an 
adjective  or  a  noun.  Consequently,  a  match  on  this  filter  would  be  a  sequence  of  the  type, 
(  Adj.  |  Noun),  (  Adj.  |  Noun),. . . ,(  Adj.  |  Noun),  Noun. 

Remark  8.1.  The  three  listed  filters  above  are  meant  to  indicate  a  successive  compounding 
of  complexity,  where  2  is  a  more  complex  version  of  1  and  3  is  a  more  complex  version  of 
2.  Ultimately  the  filter  at  3  is  the  form  we  wish  to  encode  for  out  key-phrase  estimation. 

The  second,  or  “statistical”  part  of  the  C-value/IVC'-value  algorithm  concerns  a  com¬ 
pensated  frequency  estimator,  which  deals  with  possible  complexities/ambiguities  of  sub¬ 
strings.  The  interested  reader  is  referred  to  [FAM98]. 


8.2  Visualisation  For  Key  Phrases 

An  example  of  key  phrase  estimation  for  two  subgroups  is  shown  in  Figure  6.  In  this 
example  we  show  ranking  of  the  identified  key-phases  (or  MWTs)  and  their  sentiment  is 
shown  by  colour.  The  frequency  of  use  calculations  for  this  plot  are  identical  to  those 
described  in  §7.2,  however  frequency  is  now  computed  by  occurrences  of  key-phrases  (or 
MWTs). 
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9  Future  Work 


The  project  described  in  this  report  represents  a  first-effort  (in  the  JDSC)  to  apply  text 
analysis  to  the  task  of  analysing  text-based  opinions  elicited  through  decision  support 
workshops.  While  the  results  reported  here  are  in  this  sense  preliminary,  the  outcomes  of 
this  work  have  raised  several  extending  questions. 


9.1  Kullback-Leibler  Inter- Topic  Distances 

In  the  current  implementation  of  topic  modelling  algorithms  described  in  this  report  all 
topic  centres  are  placed  “equidistant”  on  a  simplex  of  a  given  dimension.  In  some  sense  this 
display  is  a  2D  depiction  of  estimated  probability  distributions  in  a  space  of  probability 
distributions.  Technically  the  distances  between  these  topic  centres  (on  the  display)  are  a 
type  of  Euclidean  distance,  but  this  is  not  relevant  in  a  space  of  probability  distributions. 
Furthermore  spacing  topics  as  “equidistant”  points  on  a  simplex  gives  no  indication  of  how 
close  these  topic  centres  are  to  each  other.  In  a  space  of  probability  distributions  distance 
is  usually  measured  by  schemes  such  as  the  Kulback-Leibler  (KL)  distance.  As  an  example, 
recall  the  idealised  probability  distributions  in  Table  6.  The  LDA  scheme  will  estimate 
these  distributions  and  there  is  no  good  reason  to  assume  they  might  be  equidistant.  A 
potential  future  investigation  on  this  subject  might  be  to  explore  the  development  of  a 
proportional  display  where  distributions  are  spaced  according  to  a  distance  such  as  the 
symmetric  KL  distance. 


9.2  Differential  Analyses  for  Text  Subsets 

The  work  in  this  Technical  Report  concerning  the  interesting  topic  of  differential  analyses 
is  in  its  infancy.  Indeed  the  results  shown  here  are  at  best  preliminary.  This  is  an  exiting 
and  challenging  area  to  further  explore  in  a  defence  science  context.  The  literature  has  a 
diverse  collection  of  measures  for  differential  analyses  that  could  be  implemented,  tested 
and  further  developed  for  defence  application.  An  additional  challenge  here,  as  in  LDA, 
is  the  task  of  visualisation.  Clearly  the  2D  visualisation  task  is  easy  for  two  stake- holders. 
However,  this  is  a  simple  case  and  in  JDSC  decision  support  workshops  there  will  be  more 
than  two  stake- holder  groups. 


9.3  Algorithm  Property  Analyses 

The  emphasis  in  this  report  is  upon  the  development  of  sets  of  algorithms  to  perform 
various  tasks  on  a  corpus  of  text  data.  While  such  outcomes  hopefully  provide  value  to 
Defence  analyses,  they  are  by  no  means  complete.  What  remains  is  to  analyse  these  algo¬ 
rithms  for  performance,  either  empirically  or  analytically.  It  would  seem  that  computing 
analytical  results  for  performance  of  topic  estimation  is  difficult,  but  not  insurmountable. 
However,  there  are  numerous  empirical  tests  that  can  be  conceived  to  develop  further 
intuition  for  these  algorithms  and  hopefully  identify  limits  and  ambiguities. 
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9.4  Statistical  Significance  Filtering 


The  current  software  developed  for  this  project  displays  every  document  (for  example  every 
text  response)  in  the  corpus  of  text  being  examined.  In  particular,  for  a  JDSC  workshop, 
this  means  each  individuals  response  is  allocated  a  set  of  probabilities  estimating  the 
response’s  relevance  to  a  finite  number  of  topics.  In  some  cases  this  situation  might 
result  in  too  much  data  and  consequently  the  plots  on  the  topic  simplex  are  cluttered. 
Ideally  we  would  like  a  means  of  carefully  filtering/classifying  data,  based  on  notions  such 
as  the  statistical  significance  as  text  response,  for  example,  those  responses  within  one 
standard  deviation  of  a  mean  etc.  For  example  could  one,  (via  properly  defined  statistics), 
reduce  the  display  on  the  topic  simplex  to  all  those  text  responses  within  a  specified 
probabilistic  range?  Alternatively,  elicited  text  data  may  contain  outliers  which  measured 
as  statistically  insignificant,  might  contain  relevant  information.  In  this  case  one  wishes 
to  display  those  elicited  text  responses  that  are,  for  example,  3  standard  deviations  from 
the  mean  etc. 


9.5  Topic  Modelling  Visualisation  Schemes 


The  outputs  of  probabilistic  topic  modelling  are  shown,  for  example,  in  Figure  4.  Is  this  the 
best  display  ?  Higher  dimensional  data  can  be  depicted  in  a  variety  of  ways,  for  example 
see  the  excellent  text  [KC06].  One  possible  project  for  the  work  reported  here  might 
be  to  explore  alternative  displays  of  topic  estimation.  In  particular  parallel  coordinate 
plots.  These  plots  are  especially  suitable  to  our  task  as  the  range  space  of  probabilities  is 
always  the  compact  set  [0, 1].  Arguably  a  defence  analyst  examining  higher  dimensional 
data  computed  by  LDA  would  benefit  from  having  a  variety  of  higher  dimensional  data 
displays. 


9.6  The  Analysis  of  Historical  Data 


The  JDSC  has  a  relatively  large  repository  of  workshop  text  data  elicited  through  net¬ 
worked  text  collection  software.  Moreover,  each  of  these  workshop  data  sets  have  associ¬ 
ated  written  reports.  These  reports  (in  most  cases)  include  human  analysis  of  the  collected 
text,  the  results  of  which  appear  in  the  reports  in  various  forms.  Such  data  will  be  ex¬ 
amined  retrospectively  with  LDA  schemes  (and  the  other  forms  of  analysis  described  in 
this  report)  and  the  results  compared  against  the  outcomes  of  human  analysis.  Further, 
there  are  numerous  corpora  in  Defence  which  could  be  similarly  analysed,  for  example,  the 
sequence  of  Australian  Defence  White  Papers,  or  the  collections  of  so-called  Issues  Papers 
on  Force  Structure  Review. 
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10  Conclusion 

10.1  Overview 

In  this  Technical  Report  we  have  described  the  evolution  of  research  collaboration  between 
DSTO  and  NICTA,  the  aims  of  which  were  to  develop  an  algorithmic  text  analysis  capa¬ 
bility.  This  work  was  motivated  by  a  JDSC’s  need  to  provide  analysis  on  military  SME 
opinions  elicited  in  text  format.  The  JDSC’s  modus  operandi  and  means  of  text  data  col¬ 
lection  through  networked  software  has  also  been  described.  In  a  rough  way  of  speaking, 
one  can  liken  the  JDSC  workshop  text  as  a  type  of  incomplete  book  which  must  be  read 
and  analysed,  but  has  no  contents  page  or  index.  LDA  attempts  to  estimate  “contents” 
in  the  form  of  a  topic  map,  and  the  estimated  location  (index)  of  topic-specific  material 
is  given  in  terms  of  probabilities  on  a  simplex.  The  second  part  of  our  text  analysis  work 
concerned  sentiment  analysis  and  key  phrase  analysis.  Examples  were  shown  on  an  un¬ 
classified  data  set  collected  in  a  real  JDSC  workshop. 

The  work  described  in  this  report  is  clearly  in  its  infancy,  however,  it  is  hoped  that  this 
contribution  might  be  continued  and  extended  to  further  develop  this  capability. 


10.2  Summary  of  Contributions 

The  main  contribution  of  this  work  was  to  bring  to  bear  some  modern  algorithms  of 
text  analysis  on  a  real  and  ongoing  problem  in  defence,  viz.,  the  analysis  of  SME  expert 
opinions  concerning  a  given  topic  in  defence  capability.  Probabilistic  topic  modelling  via 
LDA  is  now  well  understood  and  has  been  applied  in  a  variety  of  settings.  In  our  context 
LDA  offered  a  first  exploratory  means  to  estimate  a  topic  map.  LDA  is  one  of  many  topic 
estimation  schemes  and  LDA  itself  can  be  implemented  in  different  ways,  for  example, 
Gibbs  Sampling  approximations  or  EM  algorithms.  The  true  value  of  LDA  in  our  specific 
settings  is  yet  to  be  determined  and  will  form  a  significant  part  of  our  future  work.  Our 
main  contributions  from  this  work  are: 

1.  the  development  of  a  single  integrated  software  package  for  the  text  analysis  tech¬ 
niques  described  in  this  report, 

2.  the  development  and  implementation  of  a  differential  analysis  capability  using  sen¬ 
timent  scores, 

3.  the  implementation  of  a  multi-word  term  identifier  which  scores  such  “phrases”  in 
terms  of  sentiment, 

4.  the  development  of  a  diverse  collection  of  schemes  for  the  visualisation  of  text  col¬ 
lected  from  stake  holder  groups  writing  a  common  topic, 

5.  a  computer-based  capability  for  searching  specific  corpora  of  decision  support  defence 
workshops. 
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Appendix  A  Conjugate  Priors 

A.l  Definitions 

The  notion  of  conjugate  priors  is  an  inherently  Bayesian  concept  and  was  introduced 
relatively  recently  in  1961  by  Raiffa  and  Schlaifer.  A  more  recent  edition  of  this  work 
may  be  found  in  [RSOO].  A  key  idea  of  conjugate  priors  is  (loosely)  based  on  the  property 
closure  with  respect  to  set  membership. 

Definition  A.l  (Conjugate  Prior).  Suppose  F  is  a  family  of  probability  distributions 
with  parameter  space  0.  The  family  F  is  said  to  be  conjugate  for  a  likelihood  function 
£(x  |  9)  =  p( x  |  9),  if  for  every  7 r(-)  G  F ,  then  ir(9  \  x)  G  F . 


Recall  that  the  normalised  version  of  Bayes’  rule  is  usually  written 


7 t(9  |  x) 


t r(fl)/(a  |  9) 

[  n(£,)f(x  | 

Je 


(Al) 


The  problem-child  in  equation  (Al)  is  its  denominator  on  the  right  hand  side,  which 
usually  requires  some  form  numerical  integration  when  the  integrand  is  complex  in  form, 
or  multidimensional,  or  both.  However,  the  integrand  in  this  term  is  identical  in  form  to 
the  numerator  and  so  Bayes’  rule  is  more  commonly  written  in  its  proportionality  form, 


7 t(9  |  x)  oc  7 r(9)f(x  \  9).  (A2) 

The  importance/value  of  conjugate  priors  is  essentially  based  on  computational  issues. 
If  the  collection  F  is  parametrised,  then  using  conjugate  priors  will  mean  that  updat¬ 
ing/switching  from  a  prior  to  a  posterior  distribution  is  just  a  matter  of  updating  a  finite 
set  of  parameters.  Moreover,  the  possibility  of  numerical  integrations  may  be  avoided  by 
using  conjugate  priors. 


A.  2  Example 

Remark  A.l  (Example).  The  celebrated  Kalman  filter  introduced  by  Rudolf  Kalman 
(see  [Kal60]),  offers  a  classic  illustration  of  the  computational  value  when  using  conjugate 
priors.  Loosely  speaking,  the  discrete-time  Kalman  filter  estimates  a  Gaussian  distribution 
(or  a  function)  at  each  discrete  point  in  time.  Fortunately,  a  Gaussian  distribution  is  fully 
described  by  its  two  sufficient  statistics,  that  is,  a  mean  and  a  covariance/variance.  If  a 
Gaussian  prior  is  used,  as  was  the  case  in  Kalman’s  original  work,  then  the  posterior 
distribution  is  also  Gaussian,  since  Gaussians  are  closed  under  multiplication.  Therefore 
Kalman’s  algorithm  need  only  update  the  sufficient  statistics  just  mentioned.  As  an  aside, 
the  stochastic  models  studied  by  Kalman  can  quickly  become  computationally  complex  if 
non- Gaussian  priors  are  used.  Kalman-like  estimation  schemes  for  non- Gaussian  initial 
conditions  have  been  studied  in  [ BenOf ]  and  [Mak86f. 
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Table  Al:  Posterior  distribution  parameter  updates  for  a  Beta  prior  and  binomial  likeli¬ 
hood. 


Statistic  prior  to  observing  x  posterior  to  observing  x 


mean 


a 

a  +  b 


a  +  x 
a  +  b  +  n 


variance 


ab 

(a  +  b)2(a  +  b+  1) 


( a  +  x)(b  +  n  —  x) 

(a  +  b  +  n)2(a  +  b  +  n  +  1) 


Remark  A. 2  (Example).  To  give  an  explicit  example  of  Bayesian  estimation  with  con¬ 
jugate  priors,  we  consider  an  experiment  whose  outcomes  are  distributed  according  to  a 
binomial  probability  distribution.  Suppose  X  ~  Bin(n ,  9)  where  n  is  known  and  0  6  0 
is  an  unknown  parameter.  We  suppose  that  9  is  in  the  interval  (0,1)  and  is  distributed 
according  to  a  Beta  distribution  ( see  Appendix  C),  so  that 


ir(9) 


ea~i 


(1  -  9)b~l 


B(a ,  b) 


(A3) 


Consequently  the  form  of  n(9  \  x)  in  the  above  example  is  again  a  Beta  distribution  with 
parameter  updates/recursions  listed  in  Table  Al 


Remark  A. 3.  One  well  known  example  of  a  probability  density  that  does  not  have  a 
conjugate  prior  is  the  uniform  probability  density. 


The  interested  reader  can  find  far  more  detail  on  conjugate  priors  in  the  well  known  classic 
works:  [RobOl,  RS00,  DeG04,  BS94], 
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Appendix  B  Multinomial  Probability 

Distributions 


B.l  Background 

In  this  Appendix  we  recall  some  basic  elements  of  multinomial  distributions.  Technical 
details  on  these  distributions  may  be  found  in  [Wil62]  and  [RS01]. 

In  most  introductory  courses  on  probability  theory  one  inevitably  meets  the  binomial  dis¬ 
tribution,  which  is  a  probability  distribution  for  a  finite  sequence  of  independent  experi¬ 
ments,  each  of  which  has  a  two-state  outcome,  for  example  {0, 1},  {H,  T},  or  {Success,  Fail} 
etc.  The  very  next  extension  to  this  situation  is  a  trinomial  distribution,  modelling  ex¬ 
periments  with  three  possible  outcomes.  Further,  multinomial  distributions  model  a  more 
general  version  of  the  situations  just  described  where  each  elementary  experiments  results 
in  one  and  one  only  of  r  >  2  possible  outcomes.  In  particular,  the  multinomial  distribution 
describes  the  probabilities  of  compound  events,  each  consisting  of  N  basic  experiments, 
(elementary  events),  each  of  which  has  r  possible  outcomes. 

For  brevity  we  denote  the  set  of  elementary  event  outcomes  by  S  =  {1,2,...,?’},  that 
is,  we  consider  an  integer-valued  random  variable  X  with  the  mapping 

X:Q->S.  (Bl) 

Consider  a  compound  experiment  consisting  of  N  independent  repetitions  of  the  the  RV 
X.  A  specific  realisation  or  outcome  of  this  compound  experiment  is  labelled  as 

a;  =  {ii,i2,  ■  ■  ■  ,zn}-  (B2) 

Here  in  €  S,  for  £  €  {1,2,...,  Nj  Note  that  collection  at  (B2)  is  an  ordered  IV-tuple.  the 
probabilities  assigned  to  each  uj  are 

p(uj)  =  p(w  |  Xi (uj)  =  ii,X2(u)  =  *2,  •  •  .,Xn(uj)  =  ii v), 

=  Ph  x  Pi2  x  •  •  •  x  PiN- 

What  we  would  like  to  do  is  write  down  the  probability  distribution  for  any  given  event 
w  G  H.  This  task  is  straightforward  needing  only  some  simple  formulae  from  combinatorics. 


B.2  Derivation 

Definition  B.l  (Count  Frequencies).  Suppose  {ni,n,2,  ■  ■  ■  ,  nr}  natural  numbers  whose 
values  indicate  the  number  of  specific  elementary  events  occurring  in  a  compound  event 
consisting  of  N  trials.  Clearly,  the  n£  are  linearly  dependent,  so 

r 

N  =  ^n£.  (B4) 

l=i 

The  numbers  n£  are  not  to  be  confused  with  the  values  X  might  take  from  the  set  S,  rather, 
the  numbers  n£  are  counts  for  the  occurrence  of  the  X (uj)  =  £  in  N  trials. 
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Using  these  count  frequencies  we  can  write  down  the  probability  for  a  compound  event 
having  m  occurrences  of  the  outcome  1,  77-2  occurrences  of  the  outcome  2,  and  similarly  up 
to  nr  occurrences  of  the  outcome  r.  Due  to  independence  this  quantity  is  p”1  xp%2  x  •  •  •  p”r. 
However,  numerous  outcomes  have  exactly  this  probability.  For  example,  suppose  we 
consider  a  trivial  case  of  r  =  2  and  N  =  3,  then  the  collection  of  all  possible  outcomes  is 
{1, 1,  2},  {1,  2, 1}  and  {2, 1, 1},  each  having  the  identical  probability  p\  x  p\.  We  recall  the 
following  result  from  combinatorics, 

Lemma  B.l.  The  number  of  unordered  samples  of  size  r  from  a  population  of  size  N  is, 

A  (N\  _  N(N  —  1)(N  —  2)  •  •  •  (N  —  r  +  1) 

’  \  'r  J  r\ 


Further,  for  1  <  k  <  r  we  write  Z^ipj)  for  a  random  variable  whose  integer  value  indicates 
the  number  of  times  k  €  {1,2,...,  r*}  appeared  in  N  independent  repetitions  of  our  basic 
experiment.  Clearly,  Vw  G  O 

r 

ze(u)  =  N.  (B6) 

e=  1 

Now  we  can  write  down  the  probability  for  our  compound  event  of  N  trials  which  is, 


P(Zi(w)  =  ni, . . . ,  Zr{uj)  =  nr)  =  C(N,  n{)C{N ,  712)  •  •  •  C(N,  nr )  x  pf1  x  ■  ■  ■  x  pi 

'N\  f N  —  (N  —  m  —  772  —  •  •  •  Tlr-l 


77 1 J  \  772 

x  p”1  X  •  •  •  x  p" 


nr 


(B7) 


Consequently  multinomial  probability  density  for  the  joint  events  {Z\,  Z2,  •  •  • ,  Zr}  has  the 
form, 


p(Zi  =  77i,  ..  •  ,Zr  =  nr)  = 


N\ 


77l ! 772  •  ‘  ‘  '  77r! 


Pi 1  X  •  •  •  x  pfr 


N\ 


nS=i  ne'-JfJi 


Up? 


(B8) 


The  Moment  generating  function  for  a  Multinomial  distribution  has  the  form, 

(  \N 

MGF  ■  ■  ■  ,tr)  =  (pi  exp(ti) +p2exp(t2)  H - bprexp(tr)J  . 

Here  t\,  t2,  ■  ■  ■ ,  tr  GR. 


(B9) 


Remark  B.l.  For  the  simple  case  r  =  2  the  density  at  (B8)  reduces  to  the  binomial 
probability  density.  This  is  easily  seen  from  the  MGF  by  noting  that 


/  nJV 

MGF(ti,  0, 0, . . . ,  0)  =  (pi  exp(ii)  +  p2  H - h  prj 

=  (l  -Pi  +Piexp(U))  , 


(BIO) 


which  is  the  MGF  for  a  binomial  distribution  and  Moment  generating  functions  are  unique. 
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B.3  Some  Statistics 

The  following  basic  statistics  are  of  use,  E[rii\  =  Npi,  Var[nj]  =  Npi(  1  —  pi )  and 
Co v[rii,nj\  =  —Npipj,  i  ^  j.  More  technical  details  on  multinomial  distributions  can 
be  found  in  [Wil62,  RS01].  The  multinomial  distribution  has  also  been  studied  through 
change  of  probability-measure  techniques.  The  interested  reader  is  referred  to  the  excellent 
monograph  [AE04], 
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Appendix  C  The  Beta  Distribution 

C.l  Basic  Properties 

The  Beta  probability  distribution  belongs  to  a  family  of  distributions  whose  forms  usually 
involves  Gamma  functions.  Recall,  that  the  Gamma  function  may  be  written 


COO 

T(x)  =  /  tx_1  exp(— t)dt.  (Cl) 

Jo 

Here  we  take  x  e  M+.  More  details  on  this  function  and  complex  valued  arguments  can 
be  found  in  [AS70,  GROO].  Some  useful  formulae  for  this  function  are; 

T(x)  =  (x-  1)!,  x€N  (C2) 

r(x)  =  (x  —  i)r(x  —  1),  x  e  (C3) 

F(l/2)  =  x/vr,  (C4) 

p2ai— 1 

r(2x)  =  ^-^r(x)r(x  +  \).  (C5) 

Jtt 


Given  the  connection  of  the  Beta  distribution  to  the  Dirichlet  distribution  and  its  corre¬ 
spondence  to  the  Dirichlet,  (a  Dirichlet  distribution  for  dimension  2  collapses  to  a  Beta 
distribution),  we  recall  some  basic  properties  of  the  Beta  distribution  here. 

The  term  Beta  distribution  refers  to  a  family  of  probability  distributions  defined  on  the 
set  (0, 1)  indexed  by  two  scalar-valued  parameters,  a  and  (3. 

Definition  C.l  (Beta).  The  probability  density  for  the  Beta  distribution  has  the  form, 

/(l|e)  =  B(^I“U1-h's"1-  (C6) 

Here  0  =  {a,  (3}  and  the  bounds  on  variables  are  0  <  x  <  1,  a  >  0  and  (3  >  0  and  B(a,  (3) 
denotes  the  so-called  Beta  function, 

B(a,/3)=  [  (C7) 

J  0,1 


The  Beta  function  above  is  directly  related  to  the  Gamma  function  through  the  fol¬ 
lowing  identity, 

m  =  r (aim 

B<‘  r(a  +  0)  (08) 

Conveniently  all  moments  of  the  Beta  distribution  can  be  calculated  without  evaluating 
the  integrals  in  either  (C6)  or  (C7).  With  n  >  —a  we  see  that 


\P~  i 


df 


EW  =  ^/Vr-‘(  i-o 

=  - - -  /%(Q+r!)-1(  1-ff-1 

B{ai/3)  Jo  {  ^ 


df 


B(a  +  n,  (3) 


(C9) 


B{ol,  (3) 

T(a  +  n)T(a  +  f3) 
T(a  +  (3  +  n)r(a) 
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Consequently  the  mean  and  variance  for  a  Beta  distributed  random  variable  are,  respec¬ 
tively, 


E[V]  =  (CIO) 

UU-W]  =  (a+^+g  +  1).  (CH) 

Remark  C.l.  The  Beta  distribution  provides  a  significant  modelling  flexibility  for  scalar¬ 
valued  random  variables  defined  on  the  unit  interval,  that  is,  by  varying  the  values  of  a  and 
fl  one  can  generate  a  large  variety  of  shape  features  in  Beta  distributions.  For  example, 
if  (a  >  l,fl  =  1)  then  the  family  of  curves  is  strictly  increasing,  or  strictly  decreasing 
if  (a  =  l,/3  >  1).  If  a  =  [3  symmetric  densities  are  produced.  If  a  =  (3  =  1  the  Beta 
distribution  degenerates  to  a  uniform  probability  distribution. 

Remark  C.2.  Beta  distribution  is  a  conjugate  prior  for  a  binomial  distribution. 


C.2  Checking  that  f  /(£)<$;  =  1 

Finally,  it  is  an  interesting  exercise  to  show  that  the  probability  density  given  at  (C6)  is 
in  fact  a  valid  probability  distribution.  The  parameter  ranges  given  in  Definition  (C.l) 
ensure  the  integral  is  well  defined.  However,  what  remains  is  to  show  the  that  the  Beta 
probability  density  integrates  to  unity. 

Recall  the  (real  valued)  Gamma  function  defined  at  (Cl).  We  first  consider  a  product  of 
these  functions, 


Here  a,  b  6  M+. 
the  first  being 


r(o)r(6)  = 


This  double 


poo  POO 

/  £(a_1)exp (~Odf  /  A(6_1)  exp(— X)d\ 

Jo  Jo 

POO  POO 

/  /  ^expt-OAM  exp(— \)d£d\. 


(C12) 


Jo  Jo 

integral  may  be  solved  through  change  of  variable  techniques, 


£  =  r2  cos2($),  (C13) 

A  =  r2  sin2($).  (C14) 


The  Jacobian  for  this  transformation  is 

J  =  |  ^  |  =  4 r3  sin(0)  cos (0).  (C15) 

o(r ,  0) 

Applying  this  transformation  we  get, 

P  7t/2  poo 

T(o)T(6)  =  4  /  /  (cos(0))  ('2a  1  (cos(0))  1)r(2a+26  !)  exp (~r2)drd9.  (C16) 

Jo  Jo 
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Now  consider  the  radial  component  of  this  double  integral  and  make  the  substitution 
1 

r  =  £  2  .  Then 


1  =  2  r 
Jo 


2a+2b—l 


exp(— r2)dr, 


=  2 


r»oo  i  i 

e+b-2eM-0k-id£, 


to 

roo 


roo 

=  /  ^a+b~V  exp  (-Ode, 
Jo 

=  r(a  +  b). 


Consequently 


r(a)r(6) 


t/2 


=  2 


cos(0)(2a-1)  sin^)^"1^. 


r(a  +  b ) 

To  compute  this  integral  we  make  the  substitution  cos(0)  =  y/£.  Then 

cos(0)(2a-1}  =  C~  * 
sin(0)(2°-1)  =  (l-06“  3. 

Using  these  substitutions  we  see  that 

rn/2 

cos(60l2a-lj  sin(0)^-1^  =  -2 

/o 

=  -2 


(C17) 


(CIS) 

(C19) 

(C20) 


(0)(2a-1}  sm(0)(26_1>d0  =  -2  [  ^  I)  (1  - 

Jo 

Jo 


=  I  ij^ 

Jo 

r(a)r(b) 

r(a  +  6)  ' 


(C21) 


□ 
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Appendix  D  The  Dirichlet  Probability 

Distribution 

At  first  meeting,  the  Dirichlet  distribution  will  seem  surprising  for  two  reasons 

1.  [Form]  Most  univariate  probability  distributions  may  be  written  in  a  general  form 
/(x;0),  where  x  is  the  independent  variable  and  0  denotes  a  set  of  one  or  more 
parameters.  However,  the  Dirichlet  distribution  is  different  to  such  form  in  that 
its  independent  variable  is  itself  a  discrete  probability  distribution  and  so  written 
(roughly)  as  f({pi,P2,  ■  ■  ■  ,Pk}',  0).  Consequently  the  Dirichlet  distribution  is  a 
distribution  over  finite  probability  distributions. 

2.  [Visualisation]  The  second  difficulty  one  encounters  with  the  Dirichlet  distribution 
is  that  it’s  not  easily  visualised,  especially  for  higher  dimensions.  In  the  case  of  a 
univariate  Gaussian  distribution  (for  example) ,  one  immediately  sees  from  inspection 
a  measure  of  central  tendency  (mean)  and  a  dispersion  about  a  mean  (variance). 
Other  basic  shape  features  one  might  look  for  in  a  simple  probability  distribution  are 
skew,  kurtosis  and  mode  etc.  However,  none  of  these  shape  features  are  easily  seen 
with  a  Dirichlet  distribution.  At  the  outset,  the  domain  for  a  Dirichlet  distribution  is 
a  standard  simplex  (defined  below),  consequently  plotting  and  visualising  Dirichlet 
distributions  will  not  be  routine. 


The  Dirichlet  distribution  is  named  after  the  famous  German  mathematician  Johann  Peter 
Gustav  Lejuene  Dirichlet14  [1805-1859].  Dirichlet’s  seminal  paper  on  this  topic  (concerning 
mostly  Dirichlet  Integrals)  appeared  in  1839,  see  [Dir39].  A  comprehensive  survey  of 
this  history  of  the  Dirichlet  distribution,  the  Dirichlet  integral  and  the  related  Liouville 
distribution  can  be  found  in  [GR01].  Surprisingly  this  distribution  is  not  well  known, 
indeed  many  modern  books  on  probability  theory  make  no  mention  of  it  at  all,  with  one 
notable  example  being  Mathematical  Statistics  by  S.  S.  Wilks  ([Wil62]).  However,  the 
Dirichlet  distribution  has  been  widely  applied  in  a  diversity  of  settings  such  as:  statistical 
genetics,  belief  functions,  order  statistics  (see  §8.7  in  [Wil62])  and  reliability  theory,  to 
name  a  few.  Further  detail  on  this  distribution  and  its  applications  can  be  found  in 
[Fer73,  Ant 74,  Set94,  FKG10,  BS94], 

The  Dirichlet  distributions  arises  quite  naturally  in  LDA  text  analysis  as  LDA  casts 
topic  modelling  and  estimation  as  a  Bayesian  inference  problem.  Consequently,  basic 
parameters  such  as  the  probability  distribution  of  words  in  a  topic  and  the  probability 
distribution  of  topics  within  a  document,  are  not  interpreted  as  fixed  an  unknown,  rather, 
as  random  variables,  each  with  their  own  right  with  probability  distributions.  Conse¬ 
quently  the  Dirichlet  distribution  offers  a  convenient  means  to  write  down  these  problems 
with  an  added  bonus  that  the  Dirichlet  is  a  conjugate  prior  with  the  multinomial  distribu¬ 
tion.  This  makes  an  already  complex  problem  tractable.  As  an  aside  to  this,  some  authors 
have  studied  variants  on  the  Dirichlet  distributions  in  modelling  of  “bursts  of  words”  in 
text,  see  [MKE05]. 

14One  indication  of  the  reputation  and  contributions  of  Dirichlet  is  that  he  was  chosen  to  be  C.  F.  Gauss’ 
successor  at  the  University  of  Gottingen. 
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Definition  D.l  (The  Standard  Simplex).  There  exists  a  variety  of  simplices,  for  our 
purposes  the  so-called  standard  simplex  is  sufficient.  We  denotes  a  K  dimensional  standard 
simplex  as  Sk,  where 

K 

Sk  =  {{pi,P2,  •  •  •  ,Pk)  |  Pi  e  [0, 1]  Vz  =  1,2,.. . , K,  and  ^ pi  =  l}  C  RK .  (Dl) 

i=  1 

In  some  literature  the  simplex  Sk  is  referred  to  as  the  “probability  simplex”. 

Definition  D.2  (Dirichlet  Distribution).  Consider  a  finite  probability  distribution  p , 
where 

P  =  {Pi,P2,  ■  ■  ■  ,Pk}  (D2) 

The  probability  distribution  p  is  said  to  have  a  Dirichlet  distribution,  if 

p~  DirK(p,{a1,a2,...,aK})  =  ( f[  pak~lI{P&sK}-  (D3) 

vrifc=irK)/  k=i 

Here  the  parameter  a  =  {ai,a2,  ■  ■  ■ ,  ock}  is  such  that 

ak  e  M+  Vfc  =  1,2, . . .  ,K  (D4) 

Remark  D.l.  Note  that  the  most  basic  and  expected  parameters  of  mean  and  variance 
do  not  appear  explicitly  in  the  function  given  at  equation  D3.  These  parameters  are  com¬ 
puted/derived  from  the  alpha  values.  Further,  it  is  convention  to  use  the  word  “concentra¬ 
tion”  ,  rather  than  variance  for  Dirichlet  probability  distributions. 

Remark  D.2.  If  K  =  2,  then  then  the  distribution  at  (D3)  degenerates  to  the  Beta 
distribution.  In  this  sense  the  Dirichlet  may  be  thought  of  as  a  K  >  2  generalization  of 
the  Beta  distribution. 

Remark  D.3.  If  all  components  of  a  are  set  identically  to  unity,  the  Dirichlet  becomes 
uniform  in  the  sense  that  every  candidate  p  is  equally  likely  to  occur. 

Remark  D.4.  It  is  an  interesting  and  non-trivial  exercise  to  show  that  the  Dirichlet 
distribution  integrates  to  unity  on  the  standard  simplex.  Details  of  this  calculation  can  be 
found  in  [WU62]. 


D.l  Basic  Statistics  For  A  Dirichlet  Distribution 


Given  that  the  Dirichlet  distribution’s  independent  variable  is  multivariate,  its  statistics 
may  be  given  component  wise.  These  statistics  may  be  computed  from  the  characteristic 
function  (see  Wilks  [Wil62])  or  directly  using  properties  of  Gamma  functions.  We  show  one 
such  calculation  here  for  the  means  of  a  Dirichlet  distribution.  It  is  sufficient  to  consider 
one  component  of  the  mean  as  is  shown  below. 


E\pi] 


PiDh-K(p,  a)dp 


PiDirK(p,  a)dpidp2  . . .  dpK 


_ r(«0) _ 

Pir(a1)r(a2)---r(a^) 


n  Pkk  idp 


k= 1 


(D5) 
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Here  we  write  do  =  Yhk= 1  ak-  Without  loss  of  generality  we  may  take  pi  =  pi,  so  that 


e\Pi\  =  J 


F(«o) 


K 


SK  r(ai)r(a2)  •  •  -  r (aK) 


pV 


(D6) 


k= 2 


Recalling  the  properties  of  Gamma  functions  for  on  £  M+,  we  make  the  following  assign¬ 
ments 


Consequently 


E\pi\  =  f 


Pi  —  Oi\  +  1 

Pi  =  ai}  \/i  =  2,2, .. . ,  Ah 

F(/3q) 


(D7) 

(D8) 


K 


Sk  a0  T(p1)T(p2)  ■  ■  ■  T(pK) 


Pi 


p^p 


k= 2 


dl 

«o  JsK 

a\ 
oto 


Dir  K(p,P)dp 


(D9) 


Similar  calculations  may  be  used  to  compute  other  useful  statistics  which  we  list  in  the 
table  below. 


Table  Dl:  Some  basic  statistics  for  a  Dirichlet  distribution 


E\pi] 

Oii 

OL  0 

E[p\ 

/  «2  OIR  \ 

Vdo  '  do  '  d o  / 

Var  (jpf) 

dj(d0  -  dj) 
dg(l  +  do) 

Co v(pi,pj),  i^j 

didj 

do(d0  +  1) 

Mod  e(p) 

f  di  —  1  d2  —  1  cur  ~  1  \ 

V d0  —  A" ’  do-AT’  ao-Kj 

D.2  Conjugacy  with  a  Multinomial  Distribution 

The  following  definition  is  standard  in  Bayesian  statistics, 

Definition  D.3  (Exponential  Families).  Suppose  /./(•)  is  a  a -finite  measure  defined  on 
X  and  that  0  denotes  a  parameter  space.  Suppose  that  the  functions  C(-)  and  /(•)  are, 
respectively  from  X  and  0  to  M+.  Further,  the  functions  T(-)  and  R(-)  map  from  0  and 
X  to  Ma  .  The  family  of  probability  distributions  with  respect  to  the  measure  p(-) 

f(x  |  6)  =  C(d)f(x)  exp ((R(6),T(x)))  (DIO) 

are  called  an  exponential  family  of  dimension  K . 
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For  more  details  on  exponential  families  see  [RobOl]. 
The  Dirichlet  distribution  may  be  written  as  follows, 


I\ 

Dir K{p,oc)  =  C{ot)f{p)  exp (K  ak  ln(pfc)) .  (Dll) 

k= 1 

Here  the  operator  T(-),  as  it  appears  in  equation  (DIO),  is  explicitly 

T{p)  =  (ln(pi),ln(p2), . . .  ,ln(pK))  ■  (D12) 

Consequently  Dirichlet  distributions  constitute  an  exponential  family  for  the  operator  T(-), 
as  dehned  by  equation  (D12)  and  so  have  a  conjugate  prior  due  to  the  Pittman-Koopman 
Lemma  (see  [RobOl]). 

Lemma  D.l.  Suppose  the  probability  distribution  p  =  {pi,p2,  ■  ■  ■  ,Pk}  Is  a  random  prior 
in  a  Bayesian  estimation  task  and  is  assumed  to  be  distributed  according  to  a  Dirich¬ 
let  distribution  with  parameter  a.  =  {aq,  «2,  •  •  • ,  «A'}-  Further,  suppose  the  set  of  integers 
{mi,  m2  •  •  • ,  mjf}  is  distributed  according  to  a  multinomial  distribution  with  the  same  prob¬ 
abilities  p  as  above  and  is  the  assumed  likelihood  for  this  Bayesian  estimation  task.  The 
unnormalised  posterior  distribution,  formed  through  the  product  of  these  two  distributions, 
is  an  unnormalised  Dirichlet  distribution. 


Proof  of  Lemma  D.l 

n(p  I  {m\,  m2,  ■  ■  oc  £{{mi,m2,  ■  ■  ■  ,mK}  \  p)^(p) 


N\ 

furl 

t=i 

(  ue£ 

mi\m2\  ■  ■  ■  m/d  . 

N\  | 

vnh, 

=  1  ak )  ^ 

r 

inh" 

k= 1 

mi\m2\  ■  ■  ■  m k !  ' 

V  nhi  r(at)  J 

K 


IK 


ak- 1 


oc  Dir  a'  (p,  {oq  +  mi,  a2  +  m2, . . . ,  ax  +  m, k}) 


□. 

(D13) 


D.3  Generating  Dirichlet  Random  Variables 

There  are  several  schemes  by  which  one  might  generate  samples  from  a  Dirichlet  distri¬ 
bution  and  indeed  several  physical  models  which  give  rise  to  a  Dirichlet  distribution,  for 
example,  the  Polya  Urn  scheme  and  the  so-called  unity-length  “stick  breaking”  scheme, 
see  [FKG10].  Perhaps  the  most  popular  and  convenient  scheme  to  generate  samples  from 
a  Dirichlet  distribution  is  via  the  simulation  of  independently  distributed  Gamma  random 
variables,  each  having  a  common  scale  parameter. 

Definition  D.4  (The  Gamma  Distribution).  Gamma  random  variables  are  continu¬ 
ously  distributed  and  take  values  in  the  position  half  of  the  real  line  M+ ,  according  to  the 
probability  density  function 

I  M)  =  Qk^o)^1  exP(~x/6')-  (D14) 
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Here  k  and  9  are,  respectively,  the  so-called  scale  and  shape  parameters.  These  parameters 
take  values  in  the  positive  half  of  the  real  line. 


The  computer  simulation  of  gamma  random  variables  is  discussed  in  the  articles  [AD74, 
AP76]. 


Lemma  D.2.  Suppose  a  =  {a\,  «2, . . . ,  oik}  is  a  given  parameter  for  the  Dirichlet  dis¬ 
tribution.  Further,  suppose  that  for  i  =  1,2, ,  K,  7 \  ~  r(«j,  1). 

Write 


P  =  {Pl,P2,  ■  ■  ■  ,Pk}  = 


7i 


72 


7  K 


sr^K  ’  sr^K  ' 

,Z-^j=iTj  2^j=iTj 


2^j= 1 7 3 , 


(D15) 


Then 


p  ~  DirK(ct). 


(DIG) 


Proof  of  Lemma  D.2. 

We  suppose  that  the  random  variables  71,  72,  •  •  •  ,7A'+i  are  independently  sampled  from 
Gamma  distributions  with  scale  parameters  (respectively)  a\,  oli,  . . . ,  oik+i-  Here  each 
distribution  also  has  a  common  shape  parameter  k  =  1.  The  joint  probability  density  for 
this  collection  of  independent  Gamma  variables  is, 


K+ 1 

¥>(71,72,  •  •  -  ,7*r+i)  =  II 
3= 1 


1 

f(^) 


exp(— 7j). 


Recall  that  0  <  7^  <  00. 

New  random  variables  u\,  U2,  ■  ■  ■ ,  uk ,  uk+ 1  are  now  defined  by 


Uj  = 


7i 


7i  +  72  H - Jk+1 


i  G  {1,2,...,  A'} 


uk+ 1  =  7i  +  72  H - TK+i- 


(D17) 


(D18) 

(D19) 


What  we  would  like  to  do  is  compute  the  joint  probability  density  for  the  random  variables 
,uk,uk+ 1,  noting  that  uk+i  is  a  Gamma  random  variable  with  shape  parameter 
y,  1  This  calculation  may  be  carried  out  by  using  the  Jacobian  of  transformation 
and  the  following  inverse  functions: 


7l  =  UiUK+1, 

71  =  U2UK+ 1, 

7A'+i  =  uk +i(l  -  u\  -  112  —  ■■■  -  uk) ■  (D22) 


(D20) 

(D21) 
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The  corresponding  Jacobian  has  the  form, 
J(ui,U2,  ■  ■  ■  ,  UK+ 1,71)72)  •  •  •  lK+i)  = 


UR+l  0  •  •  •  0 

0  UK+ 1  •  •  •  0 


u  i 
112 


=  ( uk+i)K ■  (D23) 


0  0  •  •  •  tlA'_ 1-1  U  K 

—UK+1  —Uk+ 1  •••  “ HAT+1  (1  —  Hi  —  U2  —  ■  ■  ■  —  Uk) 

Writing  /(•)  for  the  joint  probability  density  function  of  u\,  U2,  ■  ■  ■ ,  uk,  uk+ i,  we  see  that 


f(u1,U2,  ■  ..UK,UK+ 1)  =  ^(niHA'+l,  •  •  .UKUK+l,  UK+l(l  ~  Ui  -  U2 - UK))\J\ 

-  |  If  (»j«r«)"rl  exp(-»,tyf+i)J  x 
f(^j(«+'(l-“‘-“2 - 

exp(-HA-+i(l  -  «i - uK))\J\ 


QX  +  l-l 


,-l\  1 


Jr(aK+1) 


4+1  ia^uii5il-/|exp(-«A-+i)x 


(uA'+i(l  -ni-112 - n A')) 


&K+ 1-1 


^  1 


a,- 1 


-li  • 


UL^.)  J  /  r(aA-+1) 

(1  -  Hi  -  u2 - HA')"K+1_1nj^i=1  ^  exp(- UA'+l). 

(D24) 

To  complete  our  calculations  we  marginalise  out  the  (independent)  Gamma  random  vari¬ 
able  uk+i,  that  is. 


A  f°° 

g(ui,u2,...uK)  =  /  f(ui,u2,...uK,€)d4 

J  o 


n^r1 

rfES1  a,)  r 

Jo 


r(aA'+i) 


(1  -  111  -  112 - UK) 


(*K+ 1—1  , 


(D25) 


rXES1 «/) 


-^(ES1^-1)  exp(-OdC. 


The  last  integral  directly  above  is  a  Gamma  probability  density  integrated  over  its  entire 
domain  and  so  is  unity.  Consequently 

g(ui,u2,  •  ..uK)  = 

T(ai  +  «2  +  •  •  •  T  OiK+l)  ai-l  ajf-l/i  \afc+i-l 

r(ai)...r(oK+1)  - uk)  a  (D26) 
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Remark  D.5.  The  above  technique  to  simulate  Dirichlet  random  variables  via  the  sam¬ 
pling  of  a  Gamma  distribution  is  commonly  used.  Moreover,  since  this  is  an  exact  simula¬ 
tion  technique,  (ie  equality  in  distribution),  this  result  can  be  used  to  establish  some  inter¬ 
esting  properties  of  Dirichlet  random  variables,  such  as  the  aggregation  property  established 
in  the  next  section.  There  are  indeed  alternative  techniques  to  simulate  Dirichlet  random 
variables,  for  example,  five  different  simulation  techniques  are  presented  in  [ Nar90 ].  The 
simulation  is  Dirichlet  random  variables  is  also  discussed  in  [HC70],[Dev86]  and  [Ken09j. 


D.4  Aggregation  Property 

The  so-called  aggregation  property  of  the  Dirichlet  distribution  is  particularly  important 
in  Latent  Dirichlet  Allocation  when  applied  to  topic  modelling  in  text  analysis.  Perhaps 
the  most  popular  approach  to  derive  this  property,  is  to  make  use  make  use  of  an  additive 
property  for  Gamma  random  variables  with  common  scale-parameters  values.  This  prop¬ 
erty  and  the  generation  of  Dirichlet  random  variables  via  Gamma  random  variables,  may 
be  combined  to  establish  the  aggregation  property  of  Dirichlet  random  variables. 

Suppose  X,  Y  are  Gamma  random  variables  with  scale  parameters  Ox  and  Oy .  Further, 
suppose  the  shape  parameters  of  these  random  variables  are  kx  and  ky.  If  kx  =  ky ,  then 
the  following  equality  holds  (in  distribution), 

Gam((6*.Y  +  Oy),  k)  =4  Gam(6)x,  fc)  +  Gam  {Oy,  k).  (D27) 

A  simulation  example  of  this  property  is  shown  in  Figure  Dl.  In  this  example  Z  ~ 
Gam(5,  2.5),  X  ~  Gam(1.7,  2.5)  and  Y  ~  Gam(3.3,  2.5).  The  resulting  estimated  statistics 
were:  E[Z]  =  12.5262,  E[{Z  -  E[Z])2]  =  31.2424,  E[X  +  Y]  =  12.4906  and  E[((X  +  Y)- 
E[X  +  U])2]  =  31.2135.  The  histograms  in  Figure  Dl  were  generated  from  i.i.d.  sets  of 
10,000  samples  each.  Suppose  that  ~  Gam  (a*,  l),  where  cq  >  0  for  i  €  {1,  2, . . .  ,K}. 
Write 


1  =  (1,1,...,1)GR* 


It  was  shown  above  in  SD.3  that 


<Pi  T2 


~  DirK(p,  a). 


(D28) 

(D29) 


(D30) 


Suppose  the  index  set  {1,2 , ...  ,K}  is  decomposed  into  a  union  of  mutually  disjoint  sets 
as  follows, 

{ 1, 2,  •  •  • ,  K]  =  I\  U  1-2  ■  ■  ■  Im,  h  n  Ij  =  0,  Vi  /  j.  (D31) 

We  now  consider  aggregated  “states”  formed  through  various  additions  of  the  probabilities 
Pi,  for  example, 

Pi  =  E  (D32) 

i&Ij 
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Figure  Dl:  Simulated  example  of  the  shape-parameter  additive  property  for  gamma  dis¬ 
tributed  random  variables. 


(a)  Z  =  X  +  Y,  where  the  two  independent  ran-  (b)  Z  ~  Gamma(5,  2.5). 

dom  variables  -Y  and  X  are  distributed  as  X  ~ 

Gamma(1.7,  2.5)  and  Y  ~  Gamma(3.3,  2.5). 


Then 


{P2,P2,-  ■  -Pm}  =  (  ^2pU^2pu---i  ^2  Pi) 

\ieh  ieh  i£  IM  / 

_ 

'd  (^.1)  (^.1)  j 


=d  DirM({pi,Pi,  ...,pm},)- 


(D33) 
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Appendix  E  Gibbs  Sampling 

E.l  Background 

Gibbs  Sampling  is  a  very  well  known  numerical  method  in  statistics  based  upon  Markov 
Chain  Monte  Carlo  (MCMC)  methods.  In  particular,  this  method  provides  (in  some  cases) 
a  means  to  sample  joint  probability  distributions  of  N  dimensions  by  instead  sampling 
N  univariate  single-dimensional  marginal  distributions.  There  are  many  excellent  books 
covering  MCMC  schemes  which  have  in  recent  years  become  enormously  popular  in  areas 
such  as  filtering  and  quantitative  finance.  The  interested  reader  is  referred  to  [Gam97] 
for  an  introductory  level  treatment.  More  technical  details  on  Gibbs  sampling  may  be 
found  in  [AsmlO],  [GRS96]  and  [RC05].  The  statistical  literature  also  contains  numerous 
expository  articles  on  Gibbs  Sampling  schemes  and  theory,  for  example  [CLR01,  CG92, 
CG95,  SG92,  GS90,  BGHM95,  Gey92], 

The  seminal  ideas  of  Gibbs  sampling  were  introduced  in  the  article  [GG84],  However  an 
equally  important  article  written  by  Gelfand  and  Smith  demonstrated  a  much  broader 
class  of  problems  that  could  be  solved  through  Gibbs  Sampling,  see  [GS90] 

The  theory  of  this  scheme  is  based  primarily  upon  asymptotic  properties  Markov 
chains,  consequently  Gibbs  sampling  is  sometimes  referred  to  as  Markov  Chain  Monte 
Carlo  (MCMC).  The  relevant  Markov  chain  theory  in  support  of  Gibbs  sampling  is  dis¬ 
cussed  in  [Nor98]  and  [H04], 


E.2  Basic  Markov  Chain  Monte  Carlo  (MCMC) 

Monte  Carlo  estimation  in  general  concerns  some  form  of  approximation  achieved  through 
simulation.  In  MCMC,  the  simulation  device  used  for  the  approximation  is  one  or  more 
Markov  chains.  These  chains  are  of  course  not  arbitrary  and  must  have  certain  properties 
for  Gibbs  Sampling  (and  indeed  other  schemes)  to  achieve  the  desired  results.  Three 
fundamental  Markov  chain  state-properties  needed  for  Gibbs  sampling  are  1)  irreducibility, 
2)  recurrence,  and  3)  aperiodicity. 

Definition  E.l  (Irreducibility  of  Markov  Chains).  Consider  a  finite  Markov  chain 

with  state  space  S  =  {si,  S2,  ■  ■  ■ ,  sm}-  In  a  rough  way  of  speaking,  “ irreducibility ”  means 
any  state  in  S  can  he  reached  from  any  other  state  in  S .  To  make  this  precise  the  state¬ 
classifying  notion  of  “communicating”  states  is  used,  that  is,  Si  communicates  with  Sj 
(written  Si  — >•  Sj)  if  the  chain  has  positive  probability  of  ever  reaching  Sj  when  it  starts 
in  state  s^.  Communicating  states  may  also  be  written  by  stating  there  exists  a  natural 
number  N,  such  that, 

Prob(Xm+N  =  Sj  |  sm  =  s^  >  0.  (El) 

Note  that  for  a  homogeneous  Markov  chain  this  probability  is  independent  of  m.  If  Si  — >•  sj 
and  Sj  — >  Si,  then  the  states  Si  and  Sj  are  said  to  intercommunicate,  this  is  written  as 

Si  Sj . 

Communication  of  states  can  be  shown  to  be  an  equivalence  relation  ( see  [Ros96]). 

A  Markov  chain  with  state  space  S  is  said  to  be  irreducible  if  for  all  pairs  of  states  Si ,  Sj  6  S 
we  have  Si  -H-  Sj. 


UNCLASSIFIED 


63 


DSTO-TR-2797 


UNCLASSIFIED 


Remark  E.l.  If  a  Markov  chain  is  not  irreducible  it  is  then  reducible.  This  means  the 
long  term  behaviour  of  such  a  chain  may  be  analysed  by  one  or  more  smaller  state  space 
Markov  chains,  hence  the  term  “reducible” 

Definition  E.2  (Recurrence).  A  Markov  chain  X  is  said  to  be  recurrent  if  the  following 
certain  event  holds  for  all  states, 

prob(XM  =  Si  for  infinitely  many  M)  =  1.  (E2) 

Definition  E.3  (Aperiodicity).  A  definition  of  aperiodicity  first  requires  basic  notions 
of  divisibility  of  integers.  A  common  divisor  of  two  integers  a  and  b  is  a  third  integer  d 
such  that  d\a  and  d\b.  The  greatest  common  divisor  (g.c.d.)  of  two  non-zero  integers  a 
and  b,  is  the  largest  d,  such  that  both  d\a  and  d\b.  The  g.c.d.  is  sometimes  referred  to  as 
the  highest  common  factor  (h.c.f.).  Clearly  the  g.c.d.  can  be  extended  to  a  collection  of 
more  than  two  integers. 

The  period  of  a  state  s*  of  a  Markov  chain  is  defined  by 

li  =  g.c.d. [M  €  N  |  M  >  l,pff  >  0}.  (E3) 

Here 

Pu  =  prob{Xn+M  =  Si  |  Xn  =  s*}.  (E4) 

If  ti  =  1  for  all  states  in  a  Markov  chain,  then  that  chain  is  said  to  be  aperiodic. 

Remark  E.2.  Loosely  speaking,  the  period  G  of  a  Markov  chain  is  the  g.c.d.  of  the  set  of 
times  (discrete  indexes  representing  time)  the  chain  in  question  can  return  to  s*  given  it 
started  in  st. 

Indeed  much  more  can  be  said  about  the  definitions  above,  the  interested  reader  is 
referred  to  [Bre99] 

The  main  consequences  for  a  Markov  chain  having  these  three  properties  is  the  exis¬ 
tence  of  a  stationary  distribution  (other  important  properties  are  the  strong  law  of  large 
numbers  will  hold  and  also  the  chain  will  be  ergodic). 

Consequently,  the  basic  idea  of  Gibbs  Sampling  is  to  identify  a  Markov  chain  whose  sta¬ 
tionary  distribution  is  the  distribution  of  interest,  or  a  good  approximation  to  it.  Then 
one  samples  this  Markov  chain  in  an  appropriate  way  to  generate  samples  from  the  orig¬ 
inal  target  distribution,  from  which  all  statistics  of  interest  can  be  estimated.  In  LDA 
this  approximation  substantially  reduces  the  dimension  and  complexity  of  the  estimation 
tasks  given  the  large  dimensions  usually  arising  in  LDA  topic  modelling  problems. 

Remark  E.3.  It  should  be  noted  that  the  above  definitions  consider  Markov  chains  on  a 
finite  state  space,  however,  Gibbs  sampling  is  not  limited  to  such  Markov  chains  and  can 
be  extended  to  far  more  general  state  spaces.  This  however  raises  natural  questions  about 
the  extensions  of  the  above  definitions  and  the  existence  of  a  corresponding  stationary 
distribution  in  such  settings.  The  details  of  this  extension  for  Gibbs  sampling  are  not 
trivial  and  must  consider  more  delicate  definitions,  such  as  a  transition  probability  Kernel 
rather  than  a  transition  matrix.  For  example,  suppose  we  write  the  transition  Kernel  from 
the  vector  x  to  the  vector  y  (both  in  R.d),  as  Q(x ,  y).  Then  for  A  C 

prob(Y  £  A  |  X  =  x)  =  f  Q(x,y)dy.  (E5) 
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Now  the  analogue  of  solving  the  balance  equations  for  a  Markov  chain  is  finding  a  solution 
to  an  integral  equation,  namely 


Tf(y)  =  J  Q(x,y)ir(x)dx.  (E6) 

If  a  solution  to  (E6)  can  be  found  then  n(y)  is  the  corresponding  stationary  distribution. 
These  details  can  be  found  in  [MT93],[CMR05]  and  [RC05]. 


E.3  Example 

Finally,  to  fix  some  basic  ideas  of  Gibbs  sampling  we  recall  a  commonly  studied  example 
concerning  a  bivariate  distribution.  In  some  sense  this  example  is  vacuous  given  it  concerns 
a  bivariate  density  and  so  can  be  directly  addressed,  however,  this  example  does  offer  a 
clear  and  convincing  illustration  of  the  Gibbs  sampling  technique  and  it’s  trivial  to  encode 
in  a  digital  computer. 

In  this  example  we  consider  a  bivariate  probability  density  where  one  variable  is  integer¬ 
valued  with  finite  range  and  the  other  variable  is  continuously-valued  on  the  compact  set 
[0, 1].  Explicitly,  our  joint  density  has  the  form, 

f(k,y)  =  (^)/+“_1(  1  -  y)N-k+^  G  N  x  [0, 1],  (E7) 

Here  k  takes  value  in  the  set  {0, 1,  2, . . . ,  N},  and  y  £  [0, 1]. 

Obviously  it’s  not  immediately  clear  how  one  might  sample  from  the  non-trivial  density 
given  at  (E7).  Further,  it’s  also  not  a  simple  task  to  compute  basic  statistics  such  as  the 
mean  and  variance  for  the  density  at  (E7).  By  inspection  however,  one  might  guess  that 
k  is  some  type  of  random  variable  with  a  binomial  distribution  and  that  y  is  some  sort  of 
Beta  distributed  random  variable. 

Suppose  we  assume  y  is  known  and  fixed,  then 

prob(>  |  y)  oc  (^\yk(l  -  y)~k  (E8) 


for  all  k  £  {0, 1, . . .  N}. 

Similarly  we  suppose  k  is  fixed  and  known,  then, 

prob(y  |  k,a,/3)  oc  yk+a~1(  1  -  (E9) 


Here  a  >  0  and  fl  >  0. 

It  can  be  shown  that  the  probability  density  for  k  alone  has  the  following  form, 


m 


(N\  r(a  +  P)  T(k  +  a)T{N  -  k  +  /3) 
\k  )  r(a)r(/3)  T(a  +  P  +  N) 


(E10) 


This  exact  probability  density  is  useful  to  compare  against  the  Gibbs  Sampler. 
Following  the  article  [CG92],  we  consider  an  explicit  scenario  with  a  =  2,  p 


N  =  16. 


4  and 
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Algorithm  3  Gibbs  Sampling  Algorithm 


1:  Choose  a  number  of  iterations  K  £  N,  K  >  “burn  in” 
2:  By  some  means  choose  an  initial  k  £  {0, 1,2...,  15} 

3:  Last-L  is  assigned  the  value  k 

4:  for  t  =  1,  2,  3, . . . ,  I\  do 

5:  Sample  yi  ~  Beta(Last-fc  +  2, 16  +  4) 

6:  Sample  kg  ~  Binomial(A,  yg) 

7:  Last -k  is  assigned  the  value  ki 

8:  Store  Gibbs-sequence  value:  Qg  =  (ye, kg) 

9:  end  for 


Consequently  the  Gibbs  sampling  algorithm  for  the  joint  density  (E7)  is  given  in  algo¬ 
rithmic  form  at  Algorithm  (3).  Our  Gibbs  sampler  was  run  500  times  with  a  realisation 
length  of  2000  samples.  At  the  conclusion  of  each  of  the  500  runs  the  final  value  for  k  was 
stored.  Using  these  k  values  an  empirical  estimate  was  computed  for  the  probability  dis¬ 
tribution.  This  estimate  was  then  compared  against  the  exact  distribution  function  which 
was  computed  via  the  density  given  at  (E10).  The  estimated  (cumulative)  distribution 
function  was  computed  in  Matlab  with  the  Kaplan-Meier  estimator  (see  [ABGK93,  LK58]). 
The  results  of  this  comparison  are  shown  in  Figure  El.  Typical  Gibbs  sampler  realisations 
of  (k,y)  are  also  shown  in  Figure  El. 


Figure  El:  Gibbs  Sampler  example  for  the  bivariate  density  given  at  (E 7) 


(a)  Typical  realisations  of  0  =  (x,  y) 


(b)  True  and  estimated  Distribution  Functions 
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Appendix  F  Sample  Elicited  Text  Data 

This  Appendix  provides  a  sample  of  real  elicited  text  data.  This  data  set  was  collected  at 
the  NICTA/DSTO  Text  collection  day  and  is  shown  here.  Each  text  entry  is  tagged  with 
a  date  stamp  and  time  stamp  and  also  with  an  index  labelling  a  specific  (but  anonymous) 
workshop  attendee.  This  particular  data-set  was  generated  in  the  session  corresponding 
to  Table  2  of  §??.  The  remaining  two  sessions  from  this  workshop  generated  similar  data 
sets.  The  example  below  is  included  here  to  show  a  sampling  of  typical  elicited  text  from 
a  collection  of  uses.  Note  that  some  attendees  use  punctuation  and  so  do  not,  some  use 
capitalization  and  some  do  not,  etc. 

o  1.1  Publications  and  Journal  Rankings 
Submitted  by  19  (2011-03-24  22:14:55) 

o  1.2  The  number  of  citations  both  primary  and  secondary  citations 
Submitted  by  14  (2011-03-24  22:15:15) 

o  1.3  understanding  /  interpretation  of  the  research 

Submitted  by  21  (2011-03-24  22:15:28) 

o  1.4  Against  the  standard  criteria  of  number  publications,  media  used  for  publishing, 
citations 

Submitted  by  2  (2011-03-24  22:15:28) 

o  1.5  Our  understanding  of  problems  improves  . 

Submitted  by  6  (2011-03-24  22:15:34) 

o  1.6  Several  ways  of  measuring  the  value  of  research  -  publications,  general  interest, 
applications  with  a  definable  benefit.  However,  in  my  opinion,  the  BEST  way  of 
measuring  the  value  of  research  is  not  what  it  solves,  but  the  number  of  new  questions 
it  opens  up  for  further  research 
Submitted  by  9  (2011-03-24  22:15:41) 

o  1.7  Depends  on  the  funding  source.  Value  to  university  is  rather  nebulous:  fame, 
useful  to  students,  publicity,  awards  gained,  funding  subsequently  gained. 
Submitted  by  4  (2011-03-24  22:15:41) 

o  1.8  number  of  citations 

Submitted  by  6  (2011-03-24  22:15:42) 

o  1.9  In  relation  to  applicability  and  generalisability  of  research  outcomes;  practical 
value  of  work 

Submitted  by  2  (2011-03-24  22:16:04) 

o  1.10  Research  can  be  measured  in  a  number  of  ways,  both  short-term  and  long-term. 
Short-term  measurements  include  number  of  publications,  direct  funding  grants  for 
that  research,  patents  granted  and  patents  licensed,  or  external  auditing.  Long¬ 
term  research  is  probably  best  measure  in  terms  of  influence,  that  is  the  number 
of  citations  that  work  has  received  after  five  or  ten  years  (also  residual  long-term 
technology  and  patent  licensing). 

Submitted  by  5  (2011-03-24  22:16:14) 
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o  1.11  Measures  should  be  quantifiable  and  meaningful. 

Submitted  by  4  (2011-03-24  22:16:22) 

o  1.12  Research  is  best  measured  by  the  amount  of  subsequent  product  that  produced 
by  the  research  results. 

Submitted  by  17  (2011-03-24  22:16:30) 

o  1.13  Future  outcome  (money,  businesses,  renewal  of  science)  based  on  this  research 
Submitted  by  2  (2011-03-24  22:16:31) 

o  1.14  impact  on  the  direction  of  other  research  teams 

Submitted  by  14  (2011-03-24  22:16:53) 

o  1.15  Decisions  made  based  on  the  research.  Contributions  to  and  assisting  existing 
research.  Outcomes  in  the  form  of  new  inventions.  Adding  to  body  of  existing 
information.  Providing  an  understanding  of  a  particular  problem 
Submitted  by  11  (2011-03-24  22:16:59) 

o  1.16  Basic  and  applied  research  (and  the  many  shades  between)  require  different 
measurements  since  basic  research  tends  to  have  a  longer  time  before  it  has  impact 
whereas  applied  research  has  more  immediate  results. 

Submitted  by  1  (2011-03-24  22:17:09) 

o  1.17  improvements  in  quality  of  life  (medicine),  better  understanding  of  natural 
phenomena.  Reduces  the  influence  of  dogma. 

Submitted  by  6  (2011-03-24  22:17:12) 

o  1.18  The  value  of  research  cannot  be  measured.  If  I  knew  what  the  result  is,  it  is 
not  research. 

Submitted  by  17  (2011-03-24  22:17:22) 

o  1.19  I  would  tend  to  believe  more  strongly  in  long-term  measurements  than  short¬ 
term  measurements,  but  funding  agencies  of  course  must  rely  on  short-term  mea¬ 
surements  except  when  funding  people,  who  have  an  established  track  record. 
Submitted  by  5  (2011-03-24  22:17:27) 

o  1.20  Relation  resources  put  to  the  research  and  value  got  from  it 
Submitted  by  2  (2011-03-24  22:17:28) 

o  1.21  Carefully. 

Submitted  by  1  (2011-03-24  22:17:30) 

o  1.22  Value  could  be  indirect,  like  Central  Limit  Theorem,  which  is  a  fundamental 
thing  used  elsewhere. 

Submitted  by  4  (2011-03-24  22:17:31) 

o  1.23  Depends  on  industry  and  comparing  timelines  of  similar  problems.  Is  there  a 
tangible  outcome  or  product  from  the  research.  Will  development  continue  when 
the  research  is  completed. 

Submitted  by  20  (2011-03-24  22:17:31) 

o  1.24  The  value  of  a  research  depends  on  who  is  measuring  it.  For  universities,  it  will 
depend  on  the  amount  of  citations  it  receives  and  publication  rankings 
Submitted  by  19  (2011-03-24  22:17:57) 
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